10  ggplot 1: plot types (geometries)

In this chapter we will put the focus on plot types. We won’t worry too much about the layout, the scales or the choice of color. The basic plot types are also referred to as geometries. Before we cover these, we will first introduce ggplot and show you you can specify the aesthetics.

10.1 The layered grammer of graphics: ggplot.

{ggplot2} is one of the most widely used packages to create data visualizations in R. The “gg” stand for the grammar of graphics. In other words, {ggplot2} is an implementation of the grammar of graphics. It does so in a layered way: it builds a plot layer by layer. To add a layer, you need to use a +, not a |>. Using the + nicely shows that you add one layer on top of the other. The overall approach is straightforward: you first select the dataset and possibly the aesthetic mappings. The aesthetic mapping are included in mapping = aes(). Here you include the aesthetics that will be used by all layers. In other words, subsequent layers inherit the aesthetic mappings. You then add one or more layers with the geometries. In {ggplot2} these are referred to as geom_type, e.g. geom_point, geom_line, … . Within these geometries, you can (further) add aesthetic mappings. However, in this case, these mapping are only relevant for that specific geometry. After the geometries, you define the scales. Depending on the variable mapped on an aesthetic, these scales are either continuous or discrete. For each scale, there are default values. In other words, if you don’t include that layer, {ggplot2} will produce a graph using default values for all scale parameters. Note that often, these default values produce elegant graphs. To change the default values, you use scale_xxx_continuous or scale_xxx_discrete where xxx refers to the aesthetic, e.g x, y, color, fill, .. . For most scales, you also have to option to set them manually using scale_xxx_manual. In addition, you can transform scales using a log transformation, using the reverse of reciprocal. Using these functions, you can also set limits, or define labels for the axis. The statistics are usually included in the geometry layer. For instance, a bar charts by default shows the total number of observations per group, a density the probability, … . You can also specify the coordinate system (Cartesian, polar, …). Here, you can usually include the minimum and maximum values of e.g. the horizontal axis and vertical axis. The last layer adds the theme. This layer will be covered in Chapter 11 but in short allows you to specify all components that are part of the non data part of a plot. In this chapter we will use a standard theme theme_minimal(). This theme removes all background annotations.

Mastering the technicalities of data visualization is one thing, designing good quality visualizations is another. Here we’ll focus on the former and cover how you can use {ggplot2} to create visualizations. For the latter you can use e.g. Claus O. Wilke’s [Fundamentals of Data Visualization] (https://clauswilke.com/dataviz/) or the [BBC Visual and Data Journalism cookbook for R graphics] (https://bbc.github.io/rcookbook/#how_to_create_bbc_style_graphics) which show the BBC’s R visualization cookbook. The BBC published its {[bbplot] (https://github.com/bbc/bbplot)} package which includes the BBC styleguide.

Recall that you can save a plot (R save the plot as a list, (see Chapter 4)). Here, I will use this to add layers. For instance, suppose that there is a graph developed with ggplot(diamonds, aes(carat, price, color = cut)) + geom_point()? To illustrate the use of function scale_x_continious, we can save the first part e.g. using p1 for plot 1 and use p1 + scale_x_continuous. This would be similar to ggplot(diamonds, aes(carat, price, color = cut)) + geom_point() + scale_x_continuous.

Before we start, we need to load some packages:

10.2 First things first: saving a plot

In this and the next chapter, we will design plots. At the end of your work, you will probably want to save your visualizatin to use in a powerpoint, add in a paper, … . There are many ways to do so, e.g. add your plot to a powerpoint from within R, but often it is a good idea to save the plot. To do so, you first assign the plot to a name using plotname <- ggplot() .... R will not show the output but you’ll see plotname in your environment as a list. Using ggsave() you can now save this plot. This function has the following arguments:

ggsave(
  filename,
  plot = last_plot(),
  device = NULL,
  path = NULL,
  scale = 1,
  width = NA,
  height = NA,
  units = c("in", "cm", "mm", "px"),
  dpi = 300,
  limitsize = TRUE,
  bg = NULL,
  create.dir = FALSE,
  ...
)

The filename refers to the name of the file where you will store your plot. You can save a plot as a “png”, “eps”, “ps”, “pdf”, “tex”, “jpeg”, “tiff”, “bmp”, “svg” of “wmf” file. By default, R will save the last plot. Here, you can add the name of your plot, e.g. plot = plotname. Using {here} you can set the path where the fill will be stored. The device allows you to specify the device that will be used to turn your plot in an “pgn”, “eps”, … file. If the name of the file includes and extension, the device will be the device for that extension. If you specify a device, you can add further arguments in the [device function] (https://rdrr.io/r/grDevices/unix/png.html)/. You can also add the path using {here} in the filename. The other parameter determine the size of your plot. By default R sets the width and height using your current graphics device. DPI is the resolution of the plot. By default R used 300 dpi which is suitable for most printers, for web pages, you can reduce the resolution; for high quality print, you my need to set this to 600. If you directory doesn’t exist, changing create.dir to TRUE allows R to create a directory.

Usually, you can save a plot using

ggsave(here::here("reports", "name_of_plot_file.pgn", plot = plotname))

If this doesn’t work on your system, try another format. There might be some differences between Windows, Linux and Mac. Usually, you’ll fine a solution quite fast on e.g. stackoverflow if you keep running into issues after saving the file multiple times in a differen format. Note that sometimes opening the plot by clicking on the name in the R Files pane will not show a good result. Usually, to see the end result, opening outside of R or import it in e.g. your word processor or presentation software. Further, if you happen to use Latex (pronounce: latech) e.g. using Overleaf, you can save a plot as tex file and import it into your report, presentation of paper.

10.3 The data and aesthetic mappings

We will use a various datasets to illustrate the visualizations. In addition to the data in {nycflights23}, and Life expectancy at birth from the World Bank’s data development indicators database (Chapter 2), we will also use the diamonds dataset which comes with your {ggplot2} installation. This dataset includes the price of 53940 diamonds, measured in USD, including their attributes: carat, cut, color and clarity. The attribute carat refers to the weight of the diamond (with 1 carat of ct equal to 0.2 grams). The cut, color and clarity measure the quality of the cut (fair, good, very good, premium or ideal). The cut determines the brightness of a diamond, the dispersion of light and sparkle. The color is graded from D (best) to J (worst) and shows how colorless (best) a diamond is. The clarity show how clear a diamond is and is measures drom IF (best), VVS1, VVS2, VS1, VS2, SI1, SI2 and I1. The size of the diamond is measured as the length (x), width (y) and depth or height of the diamond (z) as well as the table, which measures the width of the top of the diamond relative to the widest point and the depth, a the relative height of the diamond to its widest point relative to its total height. You can find more information on these measures online. Because it is often not necessary to show all the data in the diamonds dataset, we’ll often use a sample of 10%. The data for life expectancy is in the data > raw folder in a file life_df.csv. You can import that file and assign it to life_df.

life_df <- readr::read_csv(here::here("data", "raw", "life_df.csv"), show_col_types = FALSE)

To build a plot, ggplot() needs to know “what” to plot. In other words, you need to tell R where the data is (in which data frame) and which variables in that data frame will be used in the visualization. This is the first layer of a plot. To add this first layer, you use the function ggplot(). The arguments of this function are:

ggplot(data = NULL, mapping = aes())

The argument data refers to the default data frame ggplot() use to search for the variables. Recall that {ggplot2} is part of the {tidyverse}. In other words, you can use pipe operator |> to “pipe” a dataset in the ggplot() function. If the object in data is not a data frame or tibble, R will try to convert the object into a data frame. If the argument is missing, each geometry has to specify the data frame to use in that geometry. The second argument mapping = aes(x = , y = , color =, size =, shape =, fill = linewidth = , linetype = ) includes the aesthetic mappings for the plot. This argument too is optional. If this mapping is not specified here, each geometry will needs its own aesthetic mapping. You can partially map variables at this level and add aesthetic mappings in the geometry layer. For instance, here you would map a variable on the x- and y-axis and add a mapping on the color aesthetic in e.g. the point geometry. In that case, the mapping on the x- and y- axis will be used for all geometries in the plot while the color aesthetic will only be used for the point geometry.

Let’s use a sample of 10% of the diamonds dataset

dia <- diamonds |> slice_sample(prop = 0.10) 

and start a plot where we map carat on the horizontal axis and price on the vertical axis:

p1 <- ggplot(data = dia, mapping = aes(x = carat, y = price)) + theme_minimal()
p1

This function returns the first layer of the plot: it shows the panel of the plot and the x- and y-axis. In addition, it shows the variables that were mapped on both axis: carat on the horizontal axis and price on the vertical axis. The limits of these axis (minimum and maximum value of the axis) are derived from the data. You can see this from the minimum and maximum values for carat and price:

dia |> summarise(
  min_carat = min(carat),
  max_carat = max(carat), 
  min_price = min(price),
  max_price = max(price)
)
# A tibble: 1 × 4
  min_carat max_carat min_price max_price
      <dbl>     <dbl>     <int>     <int>
1       0.2       3.5       336     18804

The part `aes(x = carat, y = price) includes the aesthetic mapping. To see this, if you run this part separately, R returns the mapping:

aes(x = carat, y = price)
Aesthetic mapping: 
* `x` -> `carat`
* `y` -> `price`

Because the aesthetic mapping in ggplot() is second after the data, usually, the part mapping = aes() is shortened by eliminating the reference to the argument mapping and written as aes(). Similarly, because the data argument is always first, data = is usually dropped from the argument and the function call is either ggplot(data, aes() or data |> ggplot(aes()). In the code, theme_minimal() removes e.g. the background color from the panel. We add it here, to remove colors from the background and panel.

At this stage, R is not able to add more aesthetic mappings to the output. All R knows at this stage is that the plot will include variable mappings on the x-axis and a y-axis. However, it is not able to show other aesthetic mappings, e.g. on color, size of shape. Here, R needs additional information from the geometry. With a point geometry, ggplot() will show these aesthetic mapping using a different color, size of shape of a point, with a line geometry, R will show these additional mappings by differentiating he lines using their color, width of type. However, if these aesthetic mappings are defined at this stage, R will include them in any subsequent geometry. The output from the aes() argument shows that cut is also mapped on the color. aesthetic.

ggplot(dia, aes(x = carat, y = price, color = cut, size = clarity, shape = color))
Warning: Using shapes for an ordinal variable is not advised
Warning: The shape palette can deal with a maximum of 6 discrete values because more
than 6 becomes difficult to discriminate
ℹ you have requested 7 values. Consider specifying shapes manually if you need
  that many have them.

aes(x = carat, y = price, color = cut, size = clarity, shape = color)
Aesthetic mapping: 
* `x`      -> `carat`
* `y`      -> `price`
* `colour` -> `cut`
* `size`   -> `clarity`
* `shape`  -> `color`

You can map the same variable on two ore more aesthetics. For instance, mapping cut on size and shape will show every level of but with a different size and a different shape.

10.4 Geometry

The first layer includes the data and the aesthetic mappings that will be used for all geometries in the plot (unless another aesthetic is specified). At this stage, we know which variable R will show using which aesthetics. We also know where R will find these variables: in the data frame included in the data argument of the ggplot() function. We don’t know how these variables will shown. To add this component, we need an additional layer: the one that includes the geometry and the statistics. In other words, we need a plot type. {ggplot2} includes many geometries or types of plots and I refer to the geometry section in the {ggplot2} and Chang (2025) or to view all these possibilities. Here, you’ll find the often used geometries: point geometries (e.g. scatter plots), line geometries (a line graph), area geometries, bar and column charts and geometries that summarize the data using e.g. a boxplot or a density.

Note that there are many ways to visualize the same data. The same aesthetics (x- and y-axis, color, fill, line type of width, shape, size or alpha) can be used with many geometries. However, not all geometries allow for the same aesthetics. A point geometry - where you use a “dot” to show combinations of values for the variables mapped on both axis - allows you to include color, shape, size of transparency. However light width of type are not relevant. For a bar chart, you can include fill, color, transparency, but shape would be irrelevant.

Second, every geometry is a layer. This allows you to add geometries on top of one another. Doing so, you can add a line geometry to a point geometry or text geometry. Doing so, allows you to build complex data visualizations. Here, we will not focus too much on how you can do so, but once you master the individual geometries, adding additional ones is straightforward. Usually, you can show the data in numerous ways. To illustrate, here are [100 visualizations] (https://100.datavizproject.com/) of the same dataset: a small table showing the number of World Heritage sites in Denmark, Norway and Sweden for 2004 and 2022. In other words, there are 100 ways to visualize data in a table with 2 rows and 6 columns. Selecting the best geometry to visualize your data is very important. Here are many guides to help you select the appropriate geometry, e.g. [from Data to Viz] (https://www.data-to-viz.com/), [Visual Vocabulary] (https://ft-interactive.github.io/visual-vocabulary/) or the guide from the [UK’s Office for national statistics] (https://service-manual.ons.gov.uk/data-visualisation/chart-types). If you add various geometries as separate layers, you have to think about the order in which they are shown: a line geometry followed by a point geometry will show the points on top of the lines.

10.4.0.1 Point geometries

A point geometry is used to show the correlation between two numeric, continuous variables. The first variable is mapped on the horizontal axis, the second on the vertical axis. These two, x and y, are required aesthetics. For every observations, a point geometry shows the pair of values for the variable mapped on the x-axis and the variable mapped on the y-axis as a single dot. Note that you shouldn’t interpret “dot” in a literal sense as you can change that representation and use e.g. a cross, … .

10.4.0.1.1 geom_point()

Most geometries share a large number of arguments. So, we will discuss this geometry in detail. Doing so, when we introduce other geometries we can focus on those arguments that are particular for a geometry.

10.4.0.1.1.1 The function

To illustrate point geometries, let’s use the point geometry geom_point(). The function includes the following arguments:

geom_point(
  mapping = NULL,
  data = NULL,
  stat = "identity",
  position = "identity",
  ...,
  na.rm = FALSE,
  show.legend = NA,
  inherit.aes = TRUE
)

The arguments for geom_point() include an aesthetic mapping argument mapping, a data argument data, a stats argument with default identity and a position argument with default identity. Recall from Chapter 9 that one of the elements of the grammer of graphics refers to the statistics. Here stat = "identity" means that R will plot the values of the series: the “statitics” R plots are identical to the values in the dataset. For other geometries, we’ll see for instance stat = "count". Here, R will not plot the values, but will “count” the number of observations. From position = "identity", you can see that R will also plot the values on their exact location. In other words, a pair (1.5, 2.5) will be shown in the location that corresponds with 1.5 on the horizontal axis and 2.5 on the vertical axis. The exact location is where the line starting in 1.5 on the x-axis and moving up crosses the line starting in 2.5 on the y-axis and moving left. We’ll see other position values, e.g. “jitter”, where R add a random component to both the x and y value, “stack” telling R to “stack” the values onen for one, … .

The value NULL for the first two arguments means that R will use the default aesthetic mappings included in the ggplot(aes()) call. geom_point() needs at least two mappings: one on the x- and one on the y-axis. If these mappings are not included in the ggplot() call, you have to add them here. You can use the mapping argument in this geometry also to add additional aesthetic mappings. The inherit.aes = TRUE argument shows that R will use the aesthetic mappings from the ggplot(aes()) call. If you wouldn’t want this, changing this into FALSE will remove these aesthetic mapping for that layer. Doing so, you need to add new aesthetic mappings to this geometry. With FALSE, the point geometry will not add the mapping included in mapping to the mappings in the ggplot(aes()) call. The stats argument allows you to define transformations. The default identity shows the the data as they are. The position argument with default identity shows the points as they are in the data: every “dot” is shown in the panel exactly on value pair spot. However in some cases, one point actually masks two ore more points with the same value pair. Adding position = position_jitter() R will add a random variation to every point to the left or right and to the top or bottom. You can define the maximum width and height of that random variation using the arguments width = and height =. A third option for position = is stack. We’ll cover that position more in depth when we discuss the area geometry geom_area().

In addition to these arguments, you can further specify the way in which R will plot the points, e.g. their color, fill, shape, size, stroke and alpha. You can further add the option na.rm = TRUE. In that case, ggplot() will remove missing values without a warning. The default FALSE shows a warning. By default, ggplot() will add new mappings in this layer to the guide or legend. If you don’t want to do so, you need to set inherit.aes = FALSE. The ... part allows you to define the settings: for instance, the color, size, shape or transparency of a “dot”. As we’ll see, you shouldn’t confuse setting with mapping. Both use the same reference: color, size, shape, linewidth, fill, … However, a setting such as color is part of the lay out of a graph. It tells R which color to use to show a dot. This color can be red, blue, yellow, grey or any other color that is avaiable. R will show all dots using the same color. Using the aesthetic color you map a variable on that aesthetic. Here R will show a different color for every value in the variable that you map on the aesthetic color. In other words, with a mapping the color has a meaning: different colors show different values. As a setting, a color doesn’t offer any additional information with respect to the data as all dots are shown in the same color.

Let’s see what these arguments do. We start from ggplot(data = dia, mapping = aes(x = carat, y = price)) and add the geom_point() layer, accepting all defaults:

p1 + geom_point()

Recall that p1 didn’t include the aesthetic mapping on color. In other words, this point geometry only includes the aesthetic mappings on the horizontal and vertical axis. By default, R uses black dots to represent each (carat, price) pair. Every dot shows one observation. The shape of the cloud illustrates the correlation between the variable on the horizontal and the one on the vertical axis. In Figure 10.1, there are six patterns shown in six panel. The first panel, shows no correlation. The point cloud does not show any pattern. The second panel include a pattern where the two variables are correlated, but not in a linear way. Using traditional measure for correlation, the result would suggest the absence of any correlation between variable 1 and variable 2. However, as panel 2 reveals: the correlation is actually strong, but not linear. Panels 3 to 6 show clouds that suggest weak positive correlation (panel 3), weak negative correlation (panel 4) and strong positive (panel 5) and negative (panel 6) correlation. These panel all show a point geometry. In other words, using a point geometry allows you to detect a pattern in the correlation between two variables, even if that correlation is not linear.

Figure 10.1: Correlation patterns
10.4.0.1.1.2 Changing the settings

Before we add additional aesthetic mappings, let’s first review what the options are for changing the settings of this plot. The options for the settings are usually also available for the mappings. Recall that settings don’t add any new information to the plot and only change the way the plot is shown. By default, R shows the dots with a black color. You can change the color of the points into e.g. red. You can do so by adding color = "red" to geom_point():

p1 + geom_point(color = "red")

As an alternative, you can use “lightyellow”:

p1 + geom_point(color = "lightyellow")

Here, we identify the colors by their name. As an alternative to the name you can also add a color’s HEX code. For instance, the color “lightsteelblue” has HEX code “#BEC4DE”. You can enter this hex code to define the color:

p1 + geom_point(color = "#BEC4DE")

Here too, R shows the color. The hex code and name are equivalent:

p1 + geom_point(color = "lightsteelblue")

To find a color, you can use online color pickers. These usually allow you to identify a color on a wheel or picture and return various color codes: HEX, RGB, … If you copy paste the HEX code, R will show exactly that colo. In addition, most of these color pickers allows you to generate complement colors, shades, … Here for instance, you can find HEX codes. Using their detail, you can find complements, … .

Using alpha, you can change the transparency of a point. This value is between 0 (absolute transparency) and 1 (no transparency). To illustrate, using the color red and transparency 1/5:

p1 + geom_point(color = "red", alpha = 1/5)

The plot now shows for which values the dataset includes a lot of observations. As these single, transparent, dots overlay, they produce a brighter color. Here, you can see that the dataset includes a lot of observations for diamonds with a lighter weight and lower price.

To adjust the size of the dots, you can use the size setting. For instance, setting the size = 5 shows much larger dots:

p1 + geom_point(color = "orange", size = 5)

Using size = 0.75 much smaller

p1 + geom_point(color = "orange", size = 0.75)

In addition to the color and size, you can also change the shape using shape =. To identify a shape, you can refer to its number (shown in Figure 10.2) or name (shown in Figure 10.3).

Figure 10.2: Shapes by number
Figure 10.3: Shapes by name

Let’s show the “dots” with a cross (shape 4):

p1 + geom_point(shape = 4)

Note that the default color is the cross is black. Combining both the color (e.g. lightsteelblue) and shape setting:

p1 + geom_point(color = "#BEC4DE", shape = 4)

Shapes 21 to 24 can be further adjusted using color, fill and stroke settings. The first, color sets the color of the border, the second, fill, the color of the interior and stroke the width of the border. If you add size, this setting controls the size of the interior part. The total size the shape is the sum of the interior size and the stroke. You can see this in Figure 10.4. For instance, to draw a circle with an orange border, with stroke 2 and a light blue steel interior with size 4.

p1 + geom_point( shape = 23, color = "orange", fill = "#BEC4DE", size = 4, stroke = 2)

Figure 10.4: Size and stroke

In addition to the shapes that are shown in Figure 10.2, you can add unicode UTF shapes. All characters, shapes, symbols, … have unicode “code”. You can find all codes for [geometric shapes] (https://www.w3schools.com/charsets/ref_utf_geometric.asp). Suppose I you want to use a white diamond containing a black small diamond as a shape. The unicode is “25C8”. You can add this code using “u25C7”

p1 + geom_point( shape = "\u25C7", color = "blue", size = 2)

The list of unicode symbols include math symbols, currency symbols, weather symbols, emoji’s, … .

Settings can be used to highlight specific values. Suppose for instance that you want to highlight a specific level of cut (e.g. “Premium”) in the price - carat plot. To do so, we first plot the plot and use the color settings to show these points in light grey. You can then add a second point geometry, but use a filtered dataset. To do so , you introduce a new dataset in this second point geometry using filter(dia, cut == "Premium"). Recall from Chapter 8, that this function returns a dataset. The second point geometry inherits the x and y values from the ggplot() call. In other words, the second point geometry map price and carat on the x-axis and y-axis. However, this dataset only includes the data for cut == "Premium" diamonds. Adding a different color as a setting, these points will be shown in another color. For instance:

p1 +
  geom_point(color = "lightgrey") +
  geom_point(data = filter(dia, cut == "Premium"), color = "steelblue")

As we will see, you can now add an annotation:

p1 +
  geom_point(color = "lightgrey") +
  geom_point(data = filter(dia, cut == "Premium"), color = "steelblue") +
  annotate("text", x = 0.5, y = 15000, label = "Premium cut diamonds \n shown in color", color = "steelblue")

10.4.0.1.1.3 Aesthetic mappings

Adding aesthetic mappings, in contract to settings, change the information that is shown in the plot. You add these mappings in the aes() function. Doing so, ggplot() will show different colors, shapes, sizes or strokes for every value of the variable that was mapped on that aesthetic. To illustrate, let’s again start from p1 and map the variable cut on the aesthetic color. To do so, we add aes(color = cut) to the geom_point() function and keep all other default values:

p1 + geom_point(aes(color = cut))

The result now show different colors per level of cut: diamonds whose cut is ideal are shown with a yellow dot, premium with a light green dot, … . By default, R adds a guide or legend which shows these values. Adding show.legend = FALSE would remove that legend. However, in that case, it would be difficult to interpret the color scale. Note how the plot shows each aesthetic mapping: the variable that was mapped on the horizontal axis is shown as a label under that axis, the variable that was mapped on the vertical axis is shown as a label with the vertical axis and the variable that was mapped on the color aesthetic is shown in the legend.

At this stage, adding the aesthetic mapping in the geom_point() function or in the ggplot(aes()) function shows similar results:

ggplot(dia, aes(x = carat, y = price, color = cut)) +
  geom_point() +
  theme_minimal()

This is due to the fact that all layers that follow the ggplot() function inherit the aesthetic mapping defined there. Without any aesthetic mapping in ggplot() you would have to add it in the geom_point() function:

ggplot(dia) +
  geom_point(aes(x = carat, y = price, color = cut)) +
  theme_minimal()

In that case, the mapping is only relevant for the point geometry.

What if you add aesthetic mapping on color as well as set a setting for color? In that case, R overrides the aesthetic mapping and shows all colors in the color defined in the setting. For instance:

p1 + geom_point(aes(color = cut), color = "red")

In the result, you can see that R shows all dots in red. In addition, it removes the legend. You can however, specify other settings, e.g. if you map cut on the aesthetic color, you can still change the setting shape. R will now show every point using the same shape and will use various colors to show the values of the variable cut:

p1 + geom_point(aes(color = cut), shape = 6)

Here, R maps the variable cut on the aesthetic color and shows the “dots” using a triangle.

In addition to the color aesthetic, geom_point() also accepts the fill, shape, size and stroke aesthetic. In addition, you can include more mappings in the aes() function. To illustrate. Let’s add another aesthetic mapping and map the variable in the diamonds dataset color (not to be confused with the aesthetic in R, the variable and aesthetic both happen to have the same name) on the shape aesthetic:

p1 + geom_point(aes(color = cut, shape = color))
Warning: Using shapes for an ordinal variable is not advised
Warning: The shape palette can deal with a maximum of 6 discrete values because more
than 6 becomes difficult to discriminate
ℹ you have requested 7 values. Consider specifying shapes manually if you need
  that many have them.
Warning: Removed 283 rows containing missing values or values outside the scale range
(`geom_point()`).

Here, R produces a warning: the shape palette can deal with a maximum of 6 discrete values and more than 6 become difficult to differentiate. If you look at the legend, you’ll see that R didn’t include the shape for color “J”. In this case, you could change the mapping and map cut one shape and color on the aesthetic color:

p1 + geom_point(aes(color = color, shape = cut))
Warning: Using shapes for an ordinal variable is not advised

However, here too, R shows a warning. What R is essentially telling here is that you are mapping an ordinal variable on a aesthetic that doesn’t allow to show something in an ordered way: a cross is not better or worse than a dot. For ordinal variables, it is better to use another aesthetic that does allow to show the order. Color for instance, can be shown from light to dark and reflect in order in that way. The same holds for size. A better “cut” can be shown with a larger or smaller size. For instance, if you map color on size:

p1 + geom_point(aes(color = cut, size = color))

the plot shows color “D” using the smallest dot while “J” is shown using the largest dot. The shape aesthetic allows you to map nominal values. Here, the order is not relevant can R can show these values with different shapes. However, R will restrict the number of nominal values to 6.

You can map the same variable on two aesthetics. In that case, R will change both aesthetics in line with the values of the variable that is mapped on both. For instance, if you map cut on both color and shape, ggplot() shows different level of the variable cut using both a different color and shape: “Fair” is sown with a bleu dot, “Premium” with a lightbleu cross, … .

p1 + geom_point(aes(color = cut, shape = cut))
Warning: Using shapes for an ordinal variable is not advised

In Chapter 11, we see how you can change the colors, size and shape … in the mapping and show them in the color of your choice and not in the default color values.

10.4.0.1.1.4 Adding jitter

In some cases, points are on top of each other. Consider the following graph:

ggplot(dia, aes(x = clarity, y = price)) +
  geom_point() +
  theme_minimal()

Note that this graph doesn’t make much sense, but is nicely shows that in some cases the same (clarity - price) pairs are on top of each other. As you can not see all that that are actually on the graph, this graph does not represent the data very will. In that case, you can add a bit a jitter to each plot. You can do so using the argument position =. By default the position is identity. This default tells ggplot() to draw the points in line with their (x-value, y-value) pair, even if there are points with 100% overlap. Using the jitter option, R adds a random noise to a plot. In other words, every dot is now shown using the (x-value, y-value) pair but for both R adds a random noise term. In other words it actually shows (x-value + noise, y-value + noise). As two points with 100% overlap (i.e. the same (x-value, y-value)) both get a different random noise term, the plot will show them without that overlap. To illustrate:

ggplot(dia, aes(x = clarity, y = price)) +
  geom_point(position = "jitter") +
  theme_minimal()

The plot now show more points than without the jitter. This might seem counter-intuitive, but plots with jitter actually show more information than the plot without (see Figure 10.5)

Figure 10.5: Adding jitter

You can control the “amount of jitter” using position_jitter(width = NULL, height = NULL, seed = NA). Here, you can add both the width and height of the jitter. Because jitter is added both positive and negative, the total spread is twice the amount in width and height. The default values (also used if you refer to jitter) are 0.40. Doing so, the jitter occupies 80% of the width of the categorical variable. In Figure 10.5 for instance, you can see that the width of the clarity column in the plot with jitter is 80% of the total width of that column. Adding a value of more than 0.5 doesn’t make sense: in that case the width of one column would overlap with the width of another column. However, setting smaller values, reduces the width of the column with jitter. For instance, using 0.25 for both:

ggplot(dia, aes(x = clarity, y = price)) +
  geom_point(position = position_jitter(width = 0.25, height = 0.25)) +
  theme_minimal()

Adding a seed value, allows you to reproduce the graph with “the same” random jitter.

A third way to add jitter is to use the geometry geom_jitter() and not the geometry geom_point(). In addition to the arguments for the latter, the former also includes width and height from the position_jitter() function.

10.4.0.1.2 geom_text() and geom_label()

A point (or another shape) is one of the ways to show data in a graph. Using text is a second. To illustrate, let’s use the life_df dataset. In case you haven imported it yet, you can do so using:

life_df <- read_csv(here::here("data", "raw", "life_df.csv"))

Recall that we have used this dataset in e.g. Figure 9.1 and Figure 9.2. We will remove some of the aesthetic mapping and scale attributes and start from this plot:

ggplot(filter(life_df, date == 2000), aes(x = gdp_capita, y = life_exp, color = region)) +
  geom_point() + 
  scale_x_continuous(
    transform = "log",
    breaks = c(100, 1000, 10000, 100000),
    labels = scales::label_currency(prefix = "$")) +
   theme_minimal()
Warning: Removed 22 rows containing missing values or values outside the scale range
(`geom_point()`).

As you can see, we kept the aesthetic mappings on the x- and y-axis and color. However, we dropped the aesthetic mapping on size. We also removed the labels from the axis.

life_df includes the name of the country, as well as its ISO3 code. Using geom_text() we can now use these to plot these codes (or names). To do so we add the aesthetic label in the geom_text() function and map the ISO3 code on that aesthetic. Here, the aesthetic label tells where where the values are it needs to use to show the “dots”. In this case, they are in the variable iso3c:

ggplot(filter(life_df, date == 2000), aes(x = gdp_capita, y = life_exp, color = region)) +
  geom_text(aes(label = iso3c)) + 
  scale_x_continuous(
    transform = "log",
    breaks = c(100, 1000, 10000, 100000),
    labels = scales::label_currency(prefix = "$")) +
   theme_minimal()
Warning: Removed 22 rows containing missing values or values outside the scale range
(`geom_text()`).

In the results, the “dots” are now replaced by each the ISO3 code for each country. The aesthetic mapping on color now shows as a different test color for each region in the dataset. Note that you can include that aesthetic mapping also in the ggplot() function. However, given that the labels are usually very specific to this geometry, they are usually added there and not in the default aesthetics for the entire plot. The function includes a couple of other function that are worthwhile to note:

geom_text(
  mapping = NULL,
  data = NULL,
  stat = "identity",
  position = "identity",
  ...,
  parse = FALSE,
  nudge_x = 0,
  nudge_y = 0,
  check_overlap = FALSE,
  size.unit = "mm",
  na.rm = FALSE,
  show.legend = NA,
  inherit.aes = TRUE
)

Most should seem familiar by now. The mapping needs at least a mapping on the x- and y-axis as well as a mapping on the aesthetic label. You can include these mappings in the geom_text() function or define them for all layers in the plot in ggplot(). The options that are new allow you to position the text. The first is check_overlap. By default, this is FALSE. Changing that into TRUE, ggplot() will avoid overlap in the text. nudge_x and nudge_y allow you to add some distance from the text to another geometry. For instance, if your plot also includes geom_point() then using nudge_x will allow some horizontal space, while nudge_y some vertical space between the dot and the text. Size is measured in mm (size_unit = "mm" alternatives are pt, cm, or pica pc.). The ... allow you to add setting, e.g font size, font family, color (in case that aesthetic is not used), angle, … . Let’s add a couple of settings, e.g. reduce the font size to 3, check for overlap, set the font family to “mono” or one of the other supported font families (serif, …) or add fontface (plain, bold or italic) or add an angle (22.5°). To illustrate, let’s add the country’s ISO code to the life expectancy plot, using the font family “mono” in italics and with a small angle of 22.5°:

ggplot(filter(life_df, date == 2000), aes(x = gdp_capita, y = life_exp, color = region)) +
  geom_text(
    aes(label = iso3c), 
    size = 3, 
    check_overlap = TRUE, 
    family = "mono",
    fontface = "italic",
    angle = 22.5) + 
  scale_x_continuous(
    transform = "log",
    breaks = c(100, 1000, 10000, 100000),
    labels = scales::label_currency(prefix = "$")) +
   theme_minimal()
Warning: Removed 22 rows containing missing values or values outside the scale range
(`geom_text()`).

In addition to you can use string functions to generate text labels from the aesthetic mappings. For instance, to show the values as (x, y) pairs, you can use paste0 with "(", x, ",", y, ") to show a (, the value from the variable mapped on the x-aesthetic, a “,”, the value mapped on the y-aesthetic and a closing bracket. For life_df, where per capita gdp is mapped on the x-axis, life expectancy at birth on the y-axis - and adding rounding - the code to generate a plot with (x, y) pairs is:

ggplot(filter(life_df, date == 2000), aes(x = gdp_capita, y = life_exp, color = region)) +
  geom_text(
    aes(label = paste0("(", round(gdp_capita, 0), ",", round(life_exp, 1), ")")), 
    size = 3, 
    check_overlap = TRUE, 
    family = "mono") +
  scale_x_continuous(
    transform = "log",
    breaks = c(100, 1000, 10000, 100000),
    labels = scales::label_currency(prefix = "$")) +
   theme_minimal()
Warning: Removed 22 rows containing missing values or values outside the scale range
(`geom_text()`).

As an alternative for geom_text() you can use geom_label(). Let’s check the result of this geometry:

ggplot(filter(life_df, date == 2000), aes(x = gdp_capita, y = life_exp, color = region)) +
  geom_label(aes(label = iso3c)) + 
  scale_x_continuous(
    transform = "log",
    breaks = c(100, 1000, 10000, 100000),
    labels = scales::label_currency(prefix = "$")) +
   theme_minimal()
Warning: Removed 22 rows containing missing values or values outside the scale range
(`geom_label()`).

By default, R includes values for the variable mapped on the label aesthetic but does so in a frame. The function includes the following arguments:

geom_label(
  mapping = NULL,
  data = NULL,
  stat = "identity",
  position = "identity",
  ...,
  parse = FALSE,
  nudge_x = 0,
  nudge_y = 0,
  label.padding = unit(0.25, "lines"),
  label.r = unit(0.15, "lines"),
  label.size = 0.25,
  size.unit = "mm",
  na.rm = FALSE,
  show.legend = NA,
  inherit.aes = TRUE
)

In addition to those for geom_text() these arguments also allow you to adjust the border of the label: the amount of padding in each label (label.padding), the rounding of the corners label.r and the label size. In addition, you can include setting to e.g fill the label. You can do the latter in the aes() function. Doing so, ggplot() fills the labels per color. To illustrate a couple of options. Here we will show the label, where iso3c is mapped on the fill aesthetic, the color of the font is white and the labels have straight corners. The font family is serif. Note that we moved the color aesthetic from the ggplot() call to the fill aesthetic in the geom_label() call.

ggplot(filter(life_df, date == 2000), aes(x = gdp_capita, y = life_exp)) +
  geom_label(
    aes(label = iso3c, fill = region), 
    color = "white", 
    family = "serif", 
    size = 3, 
    label.r = unit(0, "lines")) + 
  scale_x_continuous(
    transform = "log",
    breaks = c(100, 1000, 10000, 100000),
    labels = scales::label_currency(prefix = "$")) +
   theme_minimal()
Warning: Removed 22 rows containing missing values or values outside the scale range
(`geom_label()`).

You can use these geometries together with geom_point(). Doing so, the graph includes a dot for every observations as well as a label:

ggplot(filter(life_df, date == 2000), aes(x = gdp_capita, y = life_exp, color = region)) +
  geom_point() +
  geom_text(
    aes(label = iso3c), 
    check_overlap = TRUE, 
    nudge_x = 0.2, 
    nudge_y = 0.2, 
    size = 3) + 
  scale_x_continuous(
    transform = "log",
    breaks = c(100, 1000, 10000, 100000),
    labels = scales::label_currency(prefix = "$")) +
   theme_minimal()
Warning: Removed 22 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 22 rows containing missing values or values outside the scale range
(`geom_text()`).

{ggrepel} (Slowikowski (2024)) allows you to add more details to labels and text parts of a plot.

10.4.0.1.3 geom_smooth()

Although this geometry is not a point geometry, it is often used as an additional layer with a point geometry as it helps in identifying trends in (overplotted) data. To smooth a dataset is to add a function that approximates the important patterns in that dataset leaving out random noise or other rapid changes in the data. The function’s arguments are:

geom_smooth(
  mapping = NULL,
  data = NULL,
  stat = "smooth",
  position = "identity",
  ...,
  method = NULL,
  formula = NULL,
  se = TRUE,
  na.rm = FALSE,
  orientation = NA,
  show.legend = NA,
  inherit.aes = TRUE
)

You can control more options if you use stat = stat_smooth(). The aesthetics mapping needs at least a mapping on the x- and y- axis. Here, the statistic is smooth. The position = identity shows that the smooth function will show the predicted values for the variable mapped on the y-axis. With respect to the method, you can accept the default values. An analysis of those methods is left for more advanced statistics or econometrics classes. The same holds for the formula argument. By default, geom_smooth() adds a 95 confidence interval. You can change this level using level = 0.90 for a 90% confidence level or 0.99 for a 99% confidence level. This arguments is not a formal part of the geom_smooth() function, but will be used by stat_smooth() if your refer to it to change the stat argument. If you don’t want that, you can change the default se = TRUE in FALSE. The ... allow you to change the settings e.g. color, … .

Let’s return to the diamonds dataset where carat is mapped on the horizontal axis and price on the vertical axis (p1) and add geom_smooth().

p1 + geom_smooth()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

The plot shows the relation between carat on the one hand and price on the other. The expected price, given a value for carat, is shown by the blue line. The confidence level is shown as a confidence interval around that blue line. By default, is confidence level is 95%. R also shows the method used to calculate the smooth function. Let’s first change some of the settings, e.g. the line width of the blue line (linewidth =) and the color of both the line (color) and the background of the confidence interval (fill).

p1 + 
  geom_smooth(
    level = 0.90, 
    linewidth = 0.75, 
    linetype = "solid", 
    color = "#F59247", 
    fill = "#F3Be96") +
  theme_minimal()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Often geom_smooth() is added as a layer in addition to geom_point(). This produces the following graph (where I add some transparency to the points in geom_point():

p1 + 
  geom_point(alpha = 1/10) +
  geom_smooth() +
  theme_minimal()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Let’s see what happens if we add another aesthetic in the ggplot() call and map cut on the aesthetic color:

ggplot(dia, aes(x = carat, y = price, color = cut)) +
  geom_point(alpha = 1/2) +
  geom_smooth() +
  theme_minimal()

The plot now shows 5 smoothed lines, one per level of cut. The lines are in the same color as the color chosen to map cut. Why did ggplot() return one smoothed function per cut? Recall that the plot defines the aesthetic mappings in the ggplot() call. All other layers inherit this mapping. For geom_smooth() this means that it will plot a smoohted line using the x- and y- values, but will do so for every level of every aesthetic. If you would include the size aesthetic for for clarity, geom_smooth() returns 13 smoothed lines: 5 for each level of cut and 8 for each level of clarity. For the latter, geom_smooth() changes the width of the line to show the level of the aesthetic. Even if we set se = FALSE to remove the confidence levels, this plot is hardly interpretable.

ggplot(dia, aes(x = carat, y = price, color = cut, linewidth = clarity)) +
  geom_point(alpha = 1/2) +
  geom_smooth(se = FALSE) +
  theme_minimal()

To avoid this while keeping the option to add other aesthetic mapping in addition to x and y, there are various options. First, you don’t define any aesthetic mapping at the level of the plot, but keep these for the individual geometries. As mappings in one geometry are not inherited by the others, geom_smooth() will only smooth using the variables mapped in its aesthetics mapping, e.g. only x and y. To illustrate:

ggplot(dia) +
  geom_point(aes(x = carat, y = price, color = cut)) +
  geom_smooth(aes(x = carat, y = price)) +
  theme_minimal()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Here, we map the carat and price variables on the x- and y- axis and cut on the color aesthetic in the geom_point() call. In geom_smooth() we don’t include the mapping on color. Doing so, geom_smooth() only used the values mapped on the x and y axis to calculate the smoothed line. Note that this also allows you to add other aesthetics in geom_smooth(). For instance, you could draw a smoothed function per level of clarity while the points show the level of cut.

The second way to handle this is to include the aesthetic mapping at all layers have on common in the ggplot() call while adding those that are specific for each layer to the geometry for that layer. In the example: you add the aesthetic mapping of cut on color in geom_point() while adding not additional mapping to geom_smooth(). As geom_point() inherits the mappings from the ggplot() function, it it adds one addition mapping. geom_smooth() doesn’t add any other mapping. Doing so, it only used the mappings that it inherits:

ggplot(dia, aes(x = carat, y = price)) +
  geom_point(aes(color = cut)) +
  geom_smooth() +
  theme_minimal()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

There is one method and formula that is worth mentioning: a linear trend. The method to estimate a linear trend is lm of linear model. This is the method you would use to estimate linear regression models. The formula is y ~ x. This too is the formula you would use for bivariate linear regression models. Adding both allow you to add a linear trend:

ggplot(dia, aes(x = carat, y = price)) +
  geom_point(aes(color = cut)) +
  geom_smooth(method = "lm", formula = y ~ x) +
  theme_minimal()

10.4.0.2 Line geometries

In the previous section, geom_smooth() resulted in a line and is essentially a summary geometry which is shown as a line. Line geometries are ideal to show the evolution of a numeric (double) variable. Examples include a firm’s market capitalization, its sales or gross margin or macro-economic data where “evolution” is shown with a date/time variable mapped on the x-axis and the value of the series on the y-axis, … . Every (x,y) pair - e.g. (year, sales) - represents a “dot”. These dots are not shown by they are connected using a line. There three different line geometries: geom_line(), geom_step() and geom_path(). These three differ in the way they show the data. The first shows that data in the order of the dataset and uses a continuous line. For instance, if the dataset is ordered per year, the first “point” on the line will show the value of the variable mapped on the y-axis for the earliest year, the second for the second year, … . The second uses the same order to show the data, but uses straight lines to connect the “dots”. The last, geom_path() shows the data in the order in which they appear on the x-axis. If the variable mapped on the x-axis is time, geom_path() and geom_line() or geom_step() are equivalent. However, if the variable mapped on the x-axis is e.g. “gross margin” and the variable on the y-axis is “market capitalization”, geom_path() will use the lowest value for (gross margin, marketcapialization) as the first “dot”. It then connect that dot with the second lowest value for (gross margin, market capitalization), … . In addition geom_vline(), geom_hline(), geom_abline() allow you to draw line segments: a vertical line, a horizontal line or a line with a slope covering the full plot panel. Using geom_segment() and geom_curve() you can draw a line that connects two points within the panel. In general, most of what was written with respect to the aesthetic mapping in the previous section for point geometries, also applies to line geometries and most differences are straightforward: you can define the line width in a line geometry, not in a point geometry, you can have a shape in a point geometry, but not in a line geometry. So, here we’ll focus on what is specific for these geometries.

You can think of a line geometry in terms of a point geometry. While the latter shows one dot for each observations, the former implicitly connects these dots using a line. While for point geometry is very useful to show correlation, the line geometry is very useful to show evolution.

To illustrate, we’ll use the life_df dataset. In case you haven imported it yet, you can do so using:

life_df <- read_csv(here::here("data", "raw", "life_df.csv"))

Recall that we have use this dataset in e.g. Figure 9.1 and Figure 9.2. There we used the dataset for a specific year. Here, we will use all data for Costa Rica and Brazil. We’ll also rename the variable date in year.

df_life_cribra <- filter(life_df, iso3c == "CRI" | iso3c == "BRA") |> rename(year = date)

We will start with geom_line() and geom_step().

10.4.0.2.1 geom_line() and geom_step()

geom_line() and geom_step() are usually used to show the evolution of one numeric variable where a date/time is mapped on the x-axis and the value is mapped on the y-axis. These two mappings, x and y, are required. You can map other variables on the color of the line, the line width and line type. In addition you can use values of other variables to group lines. The difference between geom_line() and geom_step() is the way they connect the dots. For geom_line() the arguments are:

geom_line(
  mapping = NULL,
  data = NULL,
  stat = "identity",
  position = "identity",
  na.rm = FALSE,
  orientation = NA,
  show.legend = NA,
  inherit.aes = TRUE,
  ...
)

Most arguments of the geom_line() function should be familiar. We’ll use the position = when we discuss geom_area() where we will show how you can “stack” lines in a plot. The orientation = argument is only used if the orientation of the line is not straightforward from the data. The geom_step() functions adds a couple of arguments that are related to the way this functions shows a line: either in a horizontal or vertical segment.

geom_step(
  mapping = NULL,
  data = NULL,
  stat = "identity",
  position = "identity",
  direction = "hv",
  na.rm = FALSE,
  show.legend = NA,
  inherit.aes = TRUE,
  ...
)

Here, the function needs to know if it first has to move up (vertical) or first moves rights (horizontal). The default direction = "hv" first move right than up or down, using “vh” the first movement is up or down, than right. “Mid” means that the step is taken halfway.

Let’s illustrate these two function for the dataset with life expectancy at birth for Brazil and Costa Rica:

df_life_cribra <- life_df |> filter(iso3c == "CRI" | iso3c == "BRA") |> rename(year = date)

We first show only one country (Costa Rica). Doing so allows us to keep the aesthetic mapping simple: we’ll map the variable year on the horizontal axis and life expectancy on the vertical axis. For all other arguments, we accept the default values.

df_life_cribra |> filter(iso3c == "CRI") |>
  ggplot(aes(x = year, y = life_exp)) +
  geom_line() +
  theme_minimal()

The plot shows the evolution of life expectancy at birth since 1960 for Costa Rica. R adds the variables mapped on the horizontal and vertical axis as labels. The default color for the line is black. Adding a point geometry, you can see that a line geometry is often an extension of a point geometry. However, because here we show an evolution, a line geometry works better. Adding dots doesn’t add any additional value.

df_life_cribra |> filter(iso3c == "CRI") |>
  ggplot(aes(x = year, y = life_exp)) +
  geom_line() +
  geom_point() +
  theme_minimal()

Let’s change the settings, in other words, the visual presentation of the plot without adding any new information. In addition the color and transparency, you can change the line width, the line type (dotted, dashed, …). With respect to the line type, Figure 10.6 shows the 6 most common types: solid, dashed, dotted, dotdash, longdash and twodash. To illustrate these settings, we’ll plot the data for Costa Rica using a blue dotted line with linewidth 1.5 and a level of transparancy (alpha) equal to 0.50.

df_life_cribra |> filter(iso3c == "CRI") |>
  ggplot(aes(x = year, y = life_exp)) +
  geom_line(color = "blue", linewidth = 1.5, linetype = "dotted", alpha = 1/2) +
  theme_minimal()

Figure 10.6: Line shapes

For the geometry geom_step() with the exception of the the direction of the steps (hv, vh or mid), all other settings are similar. However, in this case, the result is not a continuous line, but a line that moves in discrete steps from one point to the other. To illustrate the latter, a point geometry shows the exact location of each observation.

df_life_cribra |> filter(iso3c == "CRI") |>
  ggplot(aes(x = year, y = life_exp)) +
  geom_step(color = "blue", direction = "vh") +
  geom_point(color = "red") +
  theme_minimal()

This plot also illustrates the direction. With vh chosen here, the movement from one point to the other starts with the vertical movement. As the line reaches the y-level of the next point, the line graph moves in horizontal direction. With ‘hv’, the first movement would be horizontal. Using mid, the vertical movement starts in the middle of the horizontal movement.

For both geom_line() and geom_step() and in addition to the required x and y mapping, you can map additional variable on the color, line type and width aesthetics. You can do so in the ggplot() function as well as in the geom_line() or geom_step() functions. The difference between both was discusses in the section on the point geometry. Here, we’ll use the life expectancy dataset with both Brazil and Costa Rica and map these countries on the color aesthetic:

df_life_cribra |>
  ggplot(aes(x = year, y = life_exp)) +
  geom_step(aes(color = country)) +
  theme_minimal()

the linetype aesthetic:

df_life_cribra |>
  ggplot(aes(x = year, y = life_exp)) +
  geom_step(aes(linetype = country)) +
  theme_minimal()

and the linewidth aesthetic:

df_life_cribra |>
  ggplot(aes(x = year, y = life_exp)) +
  geom_step(aes(linewidth = country)) +
  theme_minimal()
Warning: Using linewidth for a discrete variable is not advised.

The results would be similar for the geom_line() geometry. In Chapter 11, we see how you can change the colors, line width and type, … in the mapping. There, you’ll see how you can change e.g. the values used to set the line width or determine the color of the line.

The group aesthetic is useful is you want to show the evolution for a large number of values. Suppose that you want to show the evolution of life expectancy for all countries in the dataset. As the number of countries is very large, mapping the country variable on an aesthetic such as color, linewidth or linetype is usually not a good option. First, there are way too many colors to allow for a meaningful difference across countries. This is where the group aesthetic can help you out. To see what it does, in the next figure, we’ll map the variable country on the group aesthetic while we keep the x values for the date and the y variable for life expectancy:

life_df |>
  ggplot(aes(x = date, y = life_exp)) +
  geom_line(aes(group = country)) +
  theme_minimal()
Warning: Removed 534 rows containing missing values or values outside the scale range
(`geom_line()`).

The result is a line graph with all lines shown in the same (default) color, type and width. There is one line per country. You can combine the group aesthetic with e.g. a color aesthetic. Using the previous graph, you could map, e.g. the region variable on the color aesthetic. In that way, all countries in the same region would be shown using the same color, with differences colors per region.

life_df |>
  ggplot(aes(x = date, y = life_exp)) +
  geom_line(aes(color = region, group = country)) +
  theme_minimal()
Warning: Removed 534 rows containing missing values or values outside the scale range
(`geom_line()`).

Using {[gghighlight] (https://yutannihilation.github.io/gghighlight/index.html)} you can highlight some some countries. For ways to do so, see the documentation (Yutani (2024)).

If series are very volatile, adding the geom_smooth() allows you to filter out noise. To illustrate, we’ll add a smooth function for both Brazil and Costa Rica. We’ll accept all default values. Using the linetype of linewidth options in the geom_smooth() geometry, you could change the linetype of e.g. the smoothed line and make it a bit thinner than the line showing the actual data (or vice versa).

df_life_cribra |>
  ggplot(aes(x = year, y = life_exp, color = country)) +
  geom_line() +
  geom_smooth() +
  theme_minimal()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#####geom_path()

The previous geometries show the data as they appear in the dataset. This is not always the most useful representation of the data. If you want to show co-movement for instance, geom_line() or geom_step() are not the most suitable options. To illustrate, consider the dataset in data_beveridge.csv. The dataset includes quarterly data for the unemployment rate (the number of unemployed to the number of employed + the number of unemployed) and the vacancy rate (the number of vacancies to the number of jobs in a country). The Beveridge curve shows how both move together: when the economy is solid and growth the good, the unemployment rate should be low and the vacancy rate high. The opposite should be the case in times of recession. The dataset is includes in data > raw > data_beveridge.csv. To import the data, you can run

data_beveridge <- read_csv(here::here("data", "raw", "data_beveridge.csv")) 
Rows: 76 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): TIME PERIOD, FREQUENCY
dbl  (2): vacancy_rate, unemployment_rate
date (1): DATE

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data_beveridge <- data_beveridge |> rename(date = DATE, quarter = `TIME PERIOD`)
data_beveridge |> slice_sample(n = 5)
# A tibble: 5 × 5
  date       quarter vacancy_rate unemployment_rate FREQUENCY
  <date>     <chr>          <dbl>             <dbl> <chr>    
1 2023-06-30 Q2 2023          3.1              6.49 Q        
2 2006-03-31 Q1 2006          1.1              8.90 Q        
3 2010-12-31 Q4 2010          1.4             10.2  Q        
4 2009-03-31 Q1 2009          1.1              9.05 Q        
5 2012-09-30 Q3 2012          1.4             11.6  Q        

The data include a variable date showing the last day of the quarter, quarter, showing the quarter and the vacancy_rate and unemployment_rate. One way to show co-movement is to add both to the same line graph. Do do so, we add two line geometries: one for the unemployment rate and one for the vacancy rate.

data_beveridge |> ggplot(aes(x = date)) +
  geom_line(aes(y = unemployment_rate),color = "red") +
  geom_line(aes(y = vacancy_rate), color = "blue") +
  theme_minimal()

However, as you can see from the graph, this is not very helpful. Even if both series are measured in percent and both are usually less than 10%-points removed from one another (i.e. the are close enough to compare), it is difficult to spot co-movement. The vacancy rate’s volatility is much higher than the volatility of the unemployment rate.

geom_path allows you to map one series on the horizontal axis and another on the vertical axis. Doing so, you create a point geometry with one dot for every (x-value, y-value) pair. A line now connects all points in the order in which they appear in the for the variable mapped on the horizontal axis. To see how the graph builds, let’s start form the point geometry with the unemployment rate on the horizontal axis and the vacancy rate on the vertical axis:

data_beveridge |> ggplot(aes(x = unemployment_rate, y = vacancy_rate)) +
  geom_point() +
  theme_minimal()

The result is a scatterplot with one dot for every observation in the dataset. We don’t know however how both variables moved together. Starting from any dot in the scatter plot, there is no way to tell which dot came in the quarter after that dot. As a matter of fact, there is not way to identify the first dot in the data. This is where geom_path() makes the difference: it connects dots as they appear in the data. In other words, it shows which “dot” came first, which dot came after the first, which one came after the second, … . In short is shows the path from one point to the other: if you start from the start of the line, you can follow the path until is reaches the end of the line. Here, both the geom_path() and the point geometry are shown. However, this does not need to be the case and you can use the line path geometry as a stand alone layer.

data_beveridge |> ggplot(aes(x = unemployment_rate, y = vacancy_rate)) +
  geom_path() +
  geom_point() +
  theme_minimal()

If you add a geom_text or geom_label() layer, you can add additional information to the graph. To do so, you need to map a variable on the aesthetic label. In this case, we could use the variable quarter as it shows in which quarter the unemployment rate and vacancy rate were measured. Adding this label adds additional information: you see the movements per quarter. Fitting the label using geom_text() or geom_label() often requires some experimenting with the settings e.g. the size of the font and the nudge left, right, up or down (using the argument nudge_x or nudge_y) or the color of the font:

data_beveridge |> ggplot(aes(x = unemployment_rate, y = vacancy_rate)) +
  geom_path() +
  geom_text(aes(label = quarter), size = 3, color = "grey", nudge_x = 0.2, check_overlap = TRUE) +
  geom_point() +
  theme_minimal()

The label is shown in grey, with the check_overlap = TRUE and a nudge to the right to avoid too much overlap with the line. In addition to the options that are similar to those for the other line geometries, geom_path() includes arguments that allows you to determine how the line ends (lineend =), how various parts connect (line_join =) and e.g. the use of an arrow (arrow =).

geom_path(
  mapping = NULL,
  data = NULL,
  stat = "identity",
  position = "identity",
  ...,
  lineend = "butt",
  linejoin = "round",
  linemitre = 10,
  arrow = NULL,
  na.rm = FALSE,
  show.legend = NA,
  inherit.aes = TRUE
)

For the arrow, changing arrow = NULL in TRUE, add an arrow. You can change the default options if you add arrow = arrow() which allow you to set the angle of the arrow head (the width of the angle), length of the arrow head measured in “inches” or “mm”, if arrows are needed at the end of the line (“last”), at the beginning (“first”) or at both ends (“both”) and if you would like to arrow head to be open or closed.

arrow(angle = 30, length = unit(0.25, "inches"),
      ends = "last", type = "open")

You can now add an arrow at the end. Here, we draw an arrow with a closed arrow head with length 6 mm and angle 22.5°:

data_beveridge |> ggplot(aes(x = unemployment_rate, y = vacancy_rate)) +
  geom_path(
    arrow = arrow(angle = 22.5, length = unit(6, "mm"), ends = "last", type = "closed")) +
  geom_text(aes(label = quarter), size = 3, color = "grey", nudge_x = 0.2, check_overlap = TRUE) +
  geom_point() +
  theme_minimal()

Note that the {ggrepel} packages includes a geometry greom_text_repel() that allows you to add, e.g. line segments connecting the text and the dot. If you create a graph using geom_path() and you want to add labels, it is worthwhile to look at the options in this package. Using geom_text_repel you can fine tune most of the label or text parts (in addition to Slowikowski (2024) you can also use “?ggrepel::geom_text_repel” in the console to check all the options). To show only a limited set of possibilities with this package, here is a graph using the geom_text_repel() geometry:

data_beveridge |> ggplot(aes(x = unemployment_rate, y = vacancy_rate, label = `quarter`)) +
  geom_path() +
  geom_point(color = "red") +
  geom_text_repel(
    size = 3, 
    color = "#003049", 
    min.segment.length = 0, 
    seed = 42, 
    box.padding = 0.5, 
    max.overlaps = getOption("ggrepel.max.overlaps", default = 20)) +
  theme_minimal()

10.4.0.2.2 geom_hline(), geom_vline() and geom_abline()

These geometries allow you to add horizontal lines, vertical lines and line segments to a plot. Doing so, you can identify e.g. moments in time or or show average values for the y-axis over a longer period of time. The arguments for these three functions are very similar. As an example for geom_hline(), these arguments are

geom_hline(
  mapping = NULL,
  data = NULL,
  ...,
  yintercept,
  na.rm = FALSE,
  show.legend = NA
)

In addition to the usual arguments, this function include the argument yintercept: the value where the horizontal line crosses the vertical axis. The argument is xintercept in geom_vline() equals the value where the vertical line with intersect the horizontal axis. For geom_abline(), you need the intercept, where the line will intersect the vertical axis if the value on the horizontal axis is 0 and the slope, the increase per unit on the horizontal axis. To illustrate the use of these geometries, let’s use the unemployment rate from the data_Beveridge dataset and show this variable using a line geometry:

data_beveridge |>
  ggplot(aes(x = date, y = unemployment_rate)) +
  geom_line(color = "grey", linewidth = 0.75) +
  theme_minimal()

Suppose that you want to add the average for the entire period. The average value equals

ave_unemployment <- mean(data_beveridge$unemployment_rate)

We can add this to the plot using geom_hline().

data_beveridge |>
  ggplot(aes(x = date, y = unemployment_rate)) +
  geom_line(color = "grey", linewidth = 0.75) +
  geom_hline(yintercept = ave_unemployment, color = "lightblue", linewidth = 0.75) +
  theme_minimal()

Using geom_vline() you can also highlight the quarter where the major investment bank Lehman Brothers collapsed (third quarter 2008):

data_beveridge |>
  ggplot(aes(x = date, y = unemployment_rate)) +
  geom_line(color = "grey", linewidth = 0.75) +
  geom_vline(xintercept = as.Date("2008-09-30"), color = "orange", linewidth = 0.75) +
  theme_minimal()

Using geom_abline() you can draw any straight line. Suppose that you have a scatter plot looking like Figure 10.7

Figure 10.7: Using geom_abline() in a scatter plot

and would like top draw a red line with intercept 0 and slope equal to 10. Using geom_abline() as a additional layer, you can add this line to the scatter plot:

df |> ggplot() +
  geom_point(aes(x = x, y = z1)) +
  geom_abline(intercept = 0, slope = 10, color = "red", linewidth = 0.75) +
  theme_minimal()

All lines where drawn crossing the entire plot panel. Using geom_segment() and geom_curve() you can draw straight lines and curves that are shorter.

10.4.0.2.3 geom_segment() and geom_curve()

With geom_segment() and geom_curve you can connect two points in a graph. To illustrate both, we’ll use the life expectancy dataset and filter observations for 1980 and 2020. Using both geom_segment() and geom_curve() we will connect two points. For every country, the first point is the (per capita gdp, life expectancy) pair in 1980. The second point for each country are the values for the same variables in 2020. The required aesthetic mappings for both geometries are x, the first observations for the variable mapped on the horizontal axis, y, the first observation for the variable mapped on the vertical axis and xend and yend, the last observations of the variables mapped on the horizontal and vertical axis. For the life expectancy dataset, x = gpd per capita in 1980, y = life expectancy in 1980, xend = gdp per capita in 2020 and yend = life expectancy in 2020. Other aesthetic mappings include the color of the line, its width and type and, for geom_curve(), also the curvature.

Let’s first prepare the dataset: filter the observations of 1980 and 2020 the life_df. In addition, we need a variable for gdp in 1980 and one for gdp in 2020. The same holds for life expectancy. To create these variables, we need to pivot the dataset wider (Chapter 8):

life_seg <- life_df |> 
  filter(date == 1980 | date == 2020) |> 
  pivot_wider(
    names_from = c(date), 
    values_from = c(gdp_capita, life_exp, pop),
    names_glue = "{date}_{.value}")

We can now use life_seg to illustrate geom_segment(). To show the data, we kept the log-transformation on the horizontal axis and added the aesthetic color to show lines in different colors per region:

life_seg |>
  ggplot() +
  geom_segment(
    aes(x = `1980_gdp_capita`, xend = `2020_gdp_capita`, y = `1980_life_exp`, yend = `2020_life_exp`, color = region)) +
  scale_x_continuous(
    transform = "log",
    breaks = c(100, 1000, 10000, 100000),
    labels = scales::label_currency(prefix = "$")) +
  theme_minimal()
Warning: Removed 62 rows containing missing values or values outside the scale range
(`geom_segment()`).

Usually, to show direction, you would add arrows. You can do so using the arrow = argument in both geom_segment() and geom_curve(). Here, the arrow head is closed and is 2.5 mm long, with an angle equal to 11 degrees:

life_seg |>
  ggplot() +
  geom_segment(
    aes(x = `1980_gdp_capita`, xend = `2020_gdp_capita`, y = `1980_life_exp`, yend = `2020_life_exp`, color = region), 
    arrow = arrow(angle = 11, type = "closed", length = unit(2.5, "mm") )) +
  scale_x_continuous(
    transform = "log",
    breaks = c(100, 1000, 10000, 100000),
    labels = scales::label_currency(prefix = "$")) +
  theme_minimal()
Warning: Removed 62 rows containing missing values or values outside the scale range
(`geom_segment()`).

For every country, can now spot the evolution of both variables between 1980 and 2020. An line and arrow which points to the top right, suggests that both per capita gdp and life expectancy improved over that time period, a line pointing top left suggests that per capita gdp fell (left) but life expectancy rose (top). If the line and arrow suggest a moment down, life expectancy fell while per capita gdp rose (line facing right) or fell (life facing left).

Another way to use geom_segment() is to combine it with a point geometry and connect two points. Using the life expectancy data for 1980 and 2020, we can use a point geometry to show both values. We will do so for a sample of 40 countries and first remove observations with missing data for life expectancy in 1980. We map the countries on the vertical axis. We sort them in descending order of life expectancy at birth in 1980. To do so, we use {forcats} fct_reorder() function (Chapter 4). In that way, the countries will be ordered in descending order of life expectancy at birth in 1980. We use geom_point() to show the observations. To do so, we map life expectancy at birth on the horizontal axis and also on the size and color aesthetics. Doing so, higher values for life expectancy in will show up through a larger dot as well as the color of the dot. We drop the legend. We need two point geometries: one for 1980 and one for 2020.

life_seg |> filter(!is.na(`1980_life_exp`)) |> slice_sample(n = 40) |>
  ggplot(aes(y = fct_reorder(country, `1980_life_exp`, .desc = TRUE, na.rm = TRUE))) +
  geom_point(
    aes(x = `1980_life_exp`, size = `1980_life_exp`, color = `1980_life_exp`), 
    show.legend = FALSE) +
  geom_point(
    aes(x = `2020_life_exp`, size = `2020_life_exp`, color = `1980_life_exp`), 
    show.legend = FALSE)

We now have two dots. One with life expectancy at birth in 1980 and one in 2020. Using geom_segment() we can now connect these dots. In the aesthetics, we map life expectancy in 1980 on the x aesthetic, live expectancy in 2020 on the xend aesthetic. We also map the latter on the color aesthetic. Doing so, the segment will have the same color as the dots for life expectancy in 2020. In the last line, we change the color:

life_seg |> filter(!is.na(`1980_life_exp`)) |> slice_sample(n = 40) |>
  ggplot(aes(y = fct_reorder(country, `1980_life_exp`, .desc = TRUE, na.rm = TRUE))) +
  geom_point(
    aes(x = `1980_life_exp`, size = `1980_life_exp`, color = `1980_life_exp`), 
    show.legend = FALSE) +
  geom_point(
    aes(x = `2020_life_exp`, size = `2020_life_exp`, color = `1980_life_exp`), 
    show.legend = FALSE) +
  geom_segment(
    aes(x = `1980_life_exp`, xend = `2020_life_exp`, color = `2020_life_exp`), 
    arrow = arrow(angle = 12, type = "closed", length = unit(2, "mm")),
    show.legend = FALSE) +
  theme_minimal()

There is some further layout work, e.g. the choice of colors for the dots and lines, the labels of the axis, a caption, … . We could add the flag of the country next to its name, … . But here, you have the basic “dumbell chart”: a combination of two point geometries and one segment.

geom_curve() shows a similar plot, but adds a curved line. You can control the level of the curvature using the curvature = argument. A positive value will produce a right hand curve, a negative value a left and curvature and 0 shows a straight line. In the example, the curvature is -0.5. The plot also includes arrows.

life_seg |>
  ggplot() +
  geom_curve(
    aes(x = `1980_gdp_capita`, xend = `2020_gdp_capita`, y = `1980_life_exp`, yend = `2020_life_exp`, color = region), 
    arrow = arrow(angle = 11, type = "closed", length = unit(2.5, "mm")), 
    curvature = -0.50) +
  scale_x_continuous(
    transform = "log",
    breaks = c(100, 1000, 10000, 100000),
    labels = scales::label_currency(prefix = "$")) +
  theme_minimal()
Warning: Removed 62 rows containing missing values or values outside the scale range
(`geom_curve()`).

Getting the curvature right usually requires some experimenting with various numbers, both positive and negative, usually starting from the default value and moving closer to 0 to reduce curvature or futher from 1 to add to curvature.

You can use both geometries also to add a line or curve segment on a plot. To do so, you can use the aes() mapping. For any segment or curve you would like to add as a layer to the graph, you add a value for x, xend, y and yend in the aesthetic mapping in geom_segment() or geoom_curve() and these geometries will add a line or curve between the points (x, y) and (xend, yend). Using the unemployment data in the Beverdidge dataset, the next graph adds a curve line stressing the rise in the unemployment rate and a segment to stress the fall in the unemployment rate. The x- values x and xend are dates. For both the curve and the segment the end is shown with an arrow:

data_beveridge |>
  ggplot(aes(x = date, y = unemployment_rate)) +
  geom_line(color = "blue", linewidth = 0.75) +
  geom_curve(
    aes(x = as.Date("2008-03-31"),
        y = 7.5, 
        xend = as.Date("2013-03-31"), 
        yend = 12.5), 
    curvature = -0.25, 
    color = "red", 
    arrow = arrow(angle = 22.5, type = "closed", length = unit(5, "mm"))) +
  geom_segment(
    aes(x = as.Date("2013-06-30"),
        y = 12.5, 
        xend = as.Date("2025-03-30"), 
        yend = 6.5), 
    color = "#2ECC71", 
    arrow = arrow(angle = 22.5, type = "closed", length = unit(5, "mm"))) +
  theme_minimal()
Warning in geom_curve(aes(x = as.Date("2008-03-31"), y = 7.5, xend = as.Date("2013-03-31"), : All aesthetics have length 1, but the data has 76 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.
Warning in geom_segment(aes(x = as.Date("2013-06-30"), y = 12.5, xend = as.Date("2025-03-30"), : All aesthetics have length 1, but the data has 76 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.

As you can see, R suggests to use another function annotate() to draw these lines. We cover annotate() Chapter 11.

Note that you can use geom_hline(), geom_vline(), geom_abline(), geom_segment() and geom_curve() to draw graphs such as a supply and demand diagram.

To keep it as simple as possible and use as many line geoms as possible, let’s start from the y-axis, which we will call “Price”. We generete this series as a simple sequence from 0 to 100:

sup_dem <- data.frame(
  price = seq(0, 100, by = 1))

We can now add demand

\[ Q_D = 800 - 8 * P \]

and, using the inverse supply curve, supply

\[ P = \frac{1}{8} * Q_S => Q_S = 8 * P \]

sup_dem$demand = 800 - 8*sup_dem$price
sup_dem$supply = 8 * sup_dem$price

Using the data in sum_dem we can now build the supply and demand plot:

sup_dem |> ggplot(aes(y = price)) + # Price on the vertical axis
  # Show quanity on the horizontal axis, add some color and linetype and width (as a setting)
  geom_line(aes(x = demand), color = "grey70", linewidth = 0.75, linetype = "solid") +
  geom_line(aes(x = supply),  color = "grey70", linewidth = 0.75, linetype = "solid") +
  
  # Dotted lines for the equilibruim (note that you can calculate the equilibrium 
  # using the demand and supply parameters and add them here)
  geom_segment(aes(x = 0, y = 50, xend = 400, yend = 50), linetype = "dotted", color = "black") +
  geom_segment(aes(x = 400, y = 0, xend = 400, yend = 50), linetype = "dotted", color = "black") +
  
  # Create x and y-axis 
  geom_hline(yintercept = 0, linewidth = 1, color = "darkgrey") +
  geom_vline(xintercept = 0, linewidth = 1, color = "darkgrey") +
  
  # Set the labels for the axis
  labs(
    x = "Quanity", 
    y = "Price") +
  
  # Add some "text" annotations at position x = 700 and y = 95
  # here the text is "Supply"
  annotate("text", x = 700, y = 95, label = "Supply") +
  
  # Add some "text" annotations at position x = 700 and y = 5
  # here the text is "Demand"
  annotate("text", x = 700, y = 5, label = "Demand") +
  
  # Remove all ticks and labels for the x and y axis
  theme(axis.text.x = element_blank(),
      axis.ticks.x = element_blank(),
      axis.text.y = element_blank(),
      axis.ticks.y = element_blank(),
      
  # Show on a white background
      panel.background = element_rect(fill = "white"))
Warning in geom_segment(aes(x = 0, y = 50, xend = 400, yend = 50), linetype = "dotted", : All aesthetics have length 1, but the data has 101 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.
Warning in geom_segment(aes(x = 400, y = 0, xend = 400, yend = 50), linetype = "dotted", : All aesthetics have length 1, but the data has 101 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.

10.4.0.3 Area geometries

There are two area geometries that are often used: geom_area() and geom_ribbon().

10.4.0.3.1 geom_area()

geom_area() allows you to show the composition of a variables changes over time, e.g. the share of various components in GDP, the share of items in household expenditures, the share of one product or region in total sales … . To illustrate this geometry, we’ll use the life_expectancy at birth dataset for Brazil and Costa Rica and the AMECO database with data on age structure of the population in France: the population ages 0-14, aged 15-64 and 65 and over.

For the first dataset, if you haven’t used it yet, you can run the following code

life_df <- read_csv(here::here("data", "raw", "life_df.csv"))
Rows: 13671 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): iso2c, iso3c, country, region
dbl (4): date, gdp_capita, life_exp, pop

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df_life_cribra <- filter(life_df, iso3c == "CRI" | iso3c == "BRA") |> rename(year = date)

In Chapter 7, you used the AMECO dataset to learn some tidying skills. In the next box, I’ll create the dataset from the data in data > raw > ameco_xlsx folder. We’ll need AMEC01.XLSX.

First import the dataset

library(readxl)

ameco1 <- read_excel(
  path = here::here("data", "raw", "ameco_xlsx", "AMECO1.XLSX"), 
  sheet = NULL,
  range = NULL,
  col_names = TRUE,
  na = c(" ", "NA")
)
New names:
• `UNIT` -> `UNIT...5`
• `UNIT` -> `UNIT...11`

We’ll store two series: one with data (ameco1) and one with data labels (ameco2):

ameco2 <- ameco1 |> select(CODE, TITLE, UNIT...11)
ameco1 <- ameco1 |> select(CODE, COUNTRY, matches("\\d{4}"))

To tidy the dataset, we need to pivot, first longer followed by a pivto wider

ameco1 <- ameco1 |> pivot_longer(cols = matches("[0-9]{4}"), names_to = "year", values_to = "value")
ameco1 <- ameco1 |> pivot_wider(names_from = "CODE", values_from = "value")

After selecting the series we need, we need to pivot the data back to longer format: the population structure will be used as one variable with three level: one for each age category:

# Select the variables we need
ameco1 <- ameco1 |> select(COUNTRY, year, NPTN, NPCN, NPAN, NPON)

# year is a character variable: need to change in numeric
ameco1$year <- as.numeric(ameco1$year)

# Pivot longer to crate one variable, "age_group" with 3 values, one for each age category
ameco1 <- ameco1 |> pivot_longer(cols = starts_with("NP"), names_to = "age_group", values_to = "pop")

To get nice labels, we first identify the unique labels and merge those with the data. This will allows us to generate e.g. nice legends:

ameco2 <- ameco2 |> filter(CODE == "NPTN" | CODE == "NPCN" | CODE == "NPAN" | CODE == "NPON")
ameco2 <- unique(ameco2)
ameco1 <- ameco1 |> left_join(ameco2, by = join_by(age_group == CODE))

Tidy the names and select France for all years to 2022:

ameco1 <- ameco1 |> rename(unit = UNIT...11, country = COUNTRY, population = TITLE)
pop_fra <- ameco1 |> filter(
  country == "France" & (age_group == "NPCN" | age_group == "NPAN" | age_group == "NPON") & year <= 2022)

Let’s check the data:

pop_fra |> slice_sample(n = 5)
# A tibble: 5 × 6
  country  year age_group    pop population                    unit        
  <chr>   <dbl> <chr>      <dbl> <chr>                         <chr>       
1 France   1989 NPCN      11823. Population: 0 to 14 years     1000 persons
2 France   1996 NPCN      11649. Population: 0 to 14 years     1000 persons
3 France   2007 NPON      10434. Population: 65 years and over 1000 persons
4 France   1968 NPCN      13004. Population: 0 to 14 years     1000 persons
5 France   1963 NPCN      12875. Population: 0 to 14 years     1000 persons

Let’s first use an area graph with one variable. Using geom_area() will map the variable year on the horizontal axis and life expectancy at birth for Brazil on the vertical one. These two aesthetics are required. We will then use geom_area() and accept all default values:

df_life_cribra |> filter(iso3c == "BRA") |>
  ggplot(aes(x = year, y = life_exp)) + 
  geom_area() +
  theme_minimal()

With one variable, geom_area() essentially creates a line geometry but fills the area between the line, the horizontal axis and the vertical axis. Using the settings, you can change the color or transparency and using geom_line() you can show the line as an additional layer. For instance,

df_life_cribra |> filter(iso3c == "BRA") |>
  ggplot(aes(x = year, y = life_exp)) + 
  geom_area(fill = "blue", alpha = 0.20) +
  geom_line(color = "blue", linewidth = 0.75) +
  theme_minimal()

In the dataset, we have two countries, Brazil and Costa Rica. If you map the variable country on the fill aesthetic, ggplot() will stack the life expectancy data. In other words, it add life expectancy in Brazil to life expectancy in Costa Rica and shows the stacked outcome.

df_life_cribra |>
  ggplot(aes(x = year, y = life_exp, fill = country)) + 
  geom_area() +
  theme_minimal()

This outcome, which doesn’t make sense, is due to the default values in the geom_area() function:

geom_area(
  mapping = NULL,
  data = NULL,
  stat = "align",
  position = "stack",
  na.rm = FALSE,
  orientation = NA,
  show.legend = NA,
  inherit.aes = TRUE,
  ...,
  outline.type = "upper"
)

The mapping, data, na.rm, orientation, show.legend, inherit.aes and ... should be familiar. Although we already met stat and position, for the first we always kept identity and for the latter, we kept identity or jitter. The first, stat = "align", shows what will happen is two series are stacked, but have no common x-coordinates. By default, R will interpolate these values and align the values for each series. If there are two series, R stacks the values: in other words, it adds the values for the first series to those of the second to show the second series, the sum of the values of the first two are added to the values of the third to show the third series, … . This is the default: position = stack. Using stack = "identiy" changes this default. To illustrate, in the graph with life expectancy for Brazil and Costa Rica, geom_area() first plotted life expectancy for Costa Rica. The second series was the sum of life expectancy in Costa Rica and Brazil. If there would have been a third country, the third line would have shown the sum of life expectancy in all three these countries. Changing these defaults, R will plot two curves. However, as life expectancy in Costa Rica was always higher than in Brazil, the area for Costa Rico will overlap and mask the area for Brazil.

df_life_cribra |>
  ggplot(aes(x = year, y = life_exp, fill = country)) + 
  geom_area(stat = "identity", position = "identity") +
  theme_minimal()

To deal with these issues, you can add transparency using the alpha setting and add a line per country, change the order in which they appear, … .

df_life_cribra |>
  ggplot(aes(x = year, y = life_exp, fill = country)) + 
  geom_area(stat = "identity", position = "identity", alpha = 0.10) +
  geom_line(aes(color = country)) +
  theme_minimal()

Using geom_area() with multiple variables, stacking them should make sense. Else there is little reason why you wouldn’t use geom_line. This is the case for the French population dataset. In pop_fra we have a dataset which includes the number of people aged 0-14, 15-64 and 65 and older. The sum of these three population groups equals the total population in France. geom_area() plots these three series and stacks their values. Doing so, it creates three new series: one with the number of people ages 0-14, a second with the number of people ages 0-65 (the sum of those ages 0-14 and those ages 15-64) and a third series with the sum across age categories. To show these series, it fills the area below the first series and the x-axis, between the first and second series and between the second and last series with a color. Doing so, the size of the area shows the size of each subgroup.

pop_fra |> ggplot(aes(x = year, y = pop, fill = age_group)) +
  geom_area() + 
  theme_minimal()

Here, the graph makes sense: it shows the evolution of the total population in France as the sum of three components: those ages 0-14, 15-64 and 65 and over. The size of each cohort is shown on the vertical axis. However, there is one issue. R shows the order of the categories starting with the largest: NPAN (15-64), then the young (or children) NPCN and then the “old” or NPON. Dhis is standard as R orders to values to stack first from low to high. This order doesn’t make a lot of sense as these categories imply an order and the graph should preferably show the young at the bottom, those aged 15-64 in the middle and those aged 65 and over at the top. There are a couple of ways to deal with this. The first is to order the values in age_group (an ordered factor). Here, we will first order the ameco1 dataset and then filter the data for France. Doing so, you can create an area graph for other countries by changing the filter:

ameco1$age_group <- factor(ameco1$age_group, 
                           levels = c("NPTN", "NPON", "NPAN", "NPCN"), 
                           ordered = TRUE)

pop_fra <- ameco1 |> filter(
  country == "France" & (age_group == "NPCN" | age_group == "NPAN" | age_group == "NPON") & year <= 2022)

Let’s now see the result of the plot:

pop_fra |> ggplot(aes(x = year, y = pop, fill = age_group)) +
  geom_area()+ 
  theme_minimal()

Here you see that R ordered the area’s in line with the age categories. It also shows a different color. This is due to the fact that R chooses another color scale for ordered factors as apposed to unordered factors. The yellow at the bottom is lighter than the dark blue at the top: in other words, the hue implies an order: lighter colors for smaller values (in this cases ordered per age group).

There are other ways to change the order. For see how they work let’s create a geom_area() but now map the variable population on the fill aesthetic:

pop_fra |> ggplot(aes(x = year, y = pop, fill = population)) +
  geom_area() +
  theme_minimal()

From the legend, you can see that the plot shows to order from young to old. This is due to the fact that the order of the categories in population in alphabetical order happens to put the category with the youngest first and with the oldest last. If you want to reverse the order, you can add the option position = position_stack(reverse = TRUE):

pop_fra |> ggplot(aes(x = year, y = pop, fill = population)) +
  geom_area(position = position_stack(reverse = TRUE)) +
  theme_minimal()

The problem is not fully solved: the plot is fine, but the legend isn’t. Here, you need to add scales_fill_manual and set the legend labels manually (see Chapter 11):

pop_fra |> ggplot(aes(x = year, y = pop, fill = population)) +
  geom_area(position = position_stack(reverse = TRUE)) +
  scale_fill_discrete(breaks=c("Population: 65 years and over", "Population: 15 to 64 years", "Population: 0 to 14 years")) +
  theme_minimal()

When we used geom_line() we kept the default position = "identity". Given how close geom_area() and geom_line() are connected, it shouldn’t come as a complete surprise that you can add position = "stack" as an argument in geom_line(). Doing so, shows the stacked lines:

pop_fra |> ggplot(aes(x = year, y = pop, color = age_group)) +
  geom_line(position = "stack", linewidth = 1) + 
  theme_minimal()

Note that this graph is a bit misleading: a lot of people will interpret this graph to mean that the largest population group in France are the oldest and that there are hardly any young people, as they will not immediately notice that this is a stacked line chart, not one with position = identity. This the main reason why you should use an area chart: this because it fills the areas, it also suggests that categories are stacked. Note that you can also combine geom_area() and geom_line() with stacked lines; Doing so, allows you to add a line between the areas in the area chart. If you add some transparency to the areas, these lines will stand out:

pop_fra |> ggplot(aes(x = year, y = pop)) +
  geom_area(aes(fill = age_group), alpha = 0.20) +
  geom_line(aes(color = age_group), position = "stack", linewidth = 1) + 
  theme_minimal()

10.4.0.3.2 geom_ribbon()

geom_ribbon() is a special case of geom_area() It allows you to plot a “ribbon” defined by minimum and maximum values for the variable mapped on the vertical axis. The required aesthetics for this geometry are the variables mapped on the x-axis and those on the ymin and ymax aesthetics. The latter two define the boundaries of the ribbon. To illustrate, we’ll use the diamonds dataset, filter diamonds with carat <= 2.5, calculate the standard deviation and mean price per level of carat and show the result using a ribbon where the maximum value is defined as the mean price plus 1.96 times the standard deviation and the minimum value is defined as the mean price minus 1.96 times the standard deviation. Let’s first create the summary data frame:

diamonds_sum <- diamonds |>
  filter(carat <= 2.5) |>
  group_by(carat) |> 
  summarize(
    min_price = min(price), 
    max_price = max(price), 
    mean_price = mean(price, na.rm = TRUE), 
    sd_price = sd(price, na.rm = TRUE))

Using geom_ribbon() and accepting all default values for the settings:

diamonds_sum |> 
  ggplot(aes(
    x = carat, 
    ymin = mean_price - 1.96 * sd_price, 
    ymax = mean_price + 1.96 * sd_price)) +
  geom_ribbon() +
  theme_minimal()

This ribbon shows the carat on the horizontal axis. The vertical axis shows two variables: the mean price + 1.96 the standard deviation of the price (per carat group) and the mean prices - 1.96 times the standard deviation (per carat group). Using e.g. fill you can change the color of the ribbon, using alpha the transparency of the ribbon and using the setting color you can change the color of the lines that show the minimum and maximum values. In addition, you can set the line width and line type, For instance:

diamonds_sum |> 
  ggplot(aes(
    x = carat, 
    ymin = mean_price - 1.96 * sd_price, 
    ymax = mean_price + 1.96 * sd_price)) +
  geom_ribbon(fill = "lightblue", color = "blue", linewidth = 0.50, linetype = "solid", alpha = 0.20) +
  theme_minimal()

If you want to also the mean values, you add a line geometry:

diamonds_sum |> 
  ggplot(aes(
    x = carat, 
    ymin = mean_price - 1.96 * sd_price, 
    ymax = mean_price + 1.96 * sd_price)) +
  geom_ribbon(fill = "lightgrey", color = "darkgrey", alpha = 0.50) +
  geom_line(aes(y = mean_price), color = "blue") +
  theme_minimal()

geom_ribbon() is a special case of geom_area(). To see this, let’s change the minimum value in the previous graph to ymin = 0 and the maximum value, ymax = mean_price:

diamonds_sum |> 
  ggplot(aes(
    x = carat, 
    ymin = 0, 
    ymax = mean_price)) +
  geom_ribbon(fill = "lightblue") +
  theme_minimal()

The result is equal the the one you would have gotten with geom_area():

diamonds_sum |> 
  ggplot(aes(
    x = carat, 
    y = mean_price)) +
  geom_area(fill = "lightblue") +
  theme_minimal()

Because of these similarities, it shouldn’t come as a surprise that the arguments for geom_ribbon() and geom_area() are very similar:

geom_ribbon(
  mapping = NULL,
  data = NULL,
  stat = "identity",
  position = "identity",
  ...,
  na.rm = FALSE,
  orientation = NA,
  show.legend = NA,
  inherit.aes = TRUE,
  outline.type = "both"
)

With the exception of the default that are specific to geom_area() - stat and position - all others are very similar.

10.4.0.4 Bar or column geometries

There are two bar geometries: geom_bar() and geom_col(). By default, first shows the number of observations in each category of the variable that is mapped on the x-axis. The second allows you to show values for a variable mapped on the vertical axis, for every value of the variable mapped on the x-axis. geom_bar() has one required aesthetic: the x-axis. This variable is a discrete variable (nominal or ordinal)

10.4.0.4.1 geom_bar()

For geom_bar() the arguments are

geom_bar(
  mapping = NULL,
  data = NULL,
  stat = "count",
  position = "stack",
  ...,
  just = 0.5,
  width = NULL,
  na.rm = FALSE,
  orientation = NA,
  show.legend = NA,
  inherit.aes = TRUE
)

Here the stat argument has the value "count". In other words geom_bar() by default shows the total number of observations on the vertical axis. This also means that there is only one required aesthetic: the variable for whose values the count will happen. The alternative, "identity" shows the values of the variable mapped on the y-axis. Before moving on to the position and stat argument, a short word about some of the others. The argument just positions the bar over the major grid line for the value on the x-axis. The default value centers the bars over that grid line. Alternative values are 0 or 1 to align left or right. The width argument - which is be default 0.90 - measures how much space the bars will take up in the plot. By default this is 90% of the total area. If you reduce this to e.g. 0.75, the sum of the area or the bars will take up 75% of the total panel area.

The position argument with default "stack" matters in case the aesthetics include a mapping on e.g. fill or color. Mapping a variable on the fill aesthetic will show the number of observations for each variable on the horizontal axis, but will differentiate between various values of the variable mapped on the fill aesthetic using different colors. The color aesthetic shows similar values, but does so using a different color for the line between various subgroups. The position argument as two alternatives: "fill" and "dodge" or "dodge2". The first rescales the vertical axis from 0 to 100% and shows the various subcategories for each category on the x-axis as a percent of total values for the x-category. The second, "dodge" or "dodge2" shows various subgroups next to each other and not stacked.

To illustrate, we’ll use the diamonds dataset. To show the number of observations per level of cut, we map the cut variable on the x-axis and accept all defaults. Recall that geom_bar() will plot the count on the vertical axis. In other words, one aesthetic mapping, x = cut, is sufficient:

diamonds |>
  ggplot(aes(x = cut)) +
  geom_bar() +
  theme_minimal()

As you can see from the plot, geom_bar() adds the label “count” to the vertical axis. As cut is an ordered factor, the values on the x-axis are shown in that order. As usual, you can change the setting (i.e. layout options) using e.g. the fill (fill or the bar), color (color of the line around a bar), … by specifying those in geom_bar():

diamonds |>
  ggplot(aes(x = cut)) +
  geom_bar(fill = "lightblue", color = "blue", linewidth = 1) +
  theme_minimal()

Suppose that you want to now the number of observations for each level cut-clarity combination. To do so, you have to map the variable clarity on an aesthetic. You can use the fill aesthetic. Doing so, geom_bar() will show the number of observations per level of and will do so using different colors to fill the bars, where each color show one value of clarity:

diamonds |> 
  ggplot(aes(x = cut, fill = clarity)) +
  geom_bar() +
  theme_minimal()

There are other aesthetics that you can use to map other discrete variable, e.g. color or line width. The color shows the various subcategories using a different color for a line:

diamonds |> 
  ggplot(aes(x = cut, color = clarity)) +
  geom_bar() +
  theme_minimal()

As you can see, except when you set the color of the bars to white, the different lines reveal little on their own. The same holds for the other aesthetic mappings. In other words, the preferred aesthetic to other variables is fill.

By default, geom_bar() stacks (position = stack) the various values of the variable mapped on the fill aesthetic. To show them next to one another, you add position = "dodge" or position = "dodge2":

diamonds |> 
  ggplot(aes(x = cut, fill = clarity)) +
  geom_bar(position = "dodge") +
  theme_minimal()

For every level of cut, geom_bar() now shows 8 bars: one per level of clarity. If you use position = "dodge2", you can add further details using

position_dodge2(
  width = NULL,
  preserve = "total",
  padding = 0.1,
  reverse = FALSE
)

If you use position = "dodge", you can specify the width and preserve. The first, width is relevant is you have different geometries with different width, for instance, point geometry and a bar geometry. The second preserve is relevant in case not all values of the variable mapped on the x-axis have the same number of subcategories for the variable mapped on the fill aesthetic. By default, R preserves the total width of the of all bars for each value on the horizontal axis. The alternative, "single" preserves the width of the subcategories. To illustrate, here are two examples taken from Chang (2025):

Figure 10.8: the preserve argument

The padding argument in position_dodge2() allows to to specify the distance between two bars. The default value is 0.1. If you increase that value, the distance between two bars at the same x position widens:

diamonds |> 
  ggplot(aes(x = cut, fill = clarity)) +
  geom_bar(position = position_dodge2(padding = 0.3)) +
  theme_minimal()

A negative value causes overlap:

diamonds |> 
  ggplot(aes(x = cut, fill = clarity)) +
  geom_bar(position = position_dodge2(padding = -0.3)) +
  theme_minimal()

The argument reverse = FALSE keeps the order of the subcategories. Changing this into TRUE, reverses the order of the subcategories:

diamonds |> 
  ggplot(aes(x = cut, fill = clarity)) +
  geom_bar(position = position_dodge2(padding = -0.3, reverse = TRUE)) +
  theme_minimal()

Using position = fill, geom_bar() will show proportions. For every value of the variable mapped on the horizontal axis, the total is set equal to 100. The subcategories are then shown in a percentage on the number of observations for each of the subcategories in relation to the total number of observations for their category shown on the x-axis:

diamonds |> 
  ggplot(aes(x = cut, fill = clarity)) +
  geom_bar(position = "fill") +
  theme_minimal()

#####geom_count()

geom_count() allows you to show values for the variable mapped on the vertical axis for every value of the variable mapped on the horizontal axis. This geometry needs at least two aesthetics: a discrete variable to map on the x-axis and the variable to map on the y-axis. This geometry allows you to show, e.g. the average for a variable (e.g. price), for every value of a discrete variable (e.g. cut). Here, we’ll illustrate this geometry to visualize summary data.

The argument of this function are very similar to the arguments for the geom_bar() geometry:

geom_col(
  mapping = NULL,
  data = NULL,
  position = "stack",
  ...,
  just = 0.5,
  width = NULL,
  na.rm = FALSE,
  show.legend = NA,
  inherit.aes = TRUE
)

Suppose you want to visualize the average price per cut. Using {dplyr}’s group_by() and summarize() functions, you can calculate these averages using diamonds |> group_by(cut) |> summarize(ave_price = mean(price, na.rm = TRUE)) (see Chapter 8). Because these function return a data frame, we can connect their output in a pipe with ggplot(). In the next code, the first two lines calculate the average price per cut. We map the cut, the discrete variable, on the horizontal axis and the average price on the vertical axis. Note that the names of these variables are found in the tibble returned by the summarize() function. This data frame is use by ggplot() to find the variables. Here, we accept all default values.

diamonds |> group_by(cut) |>
  summarize(ave_price = mean(price, na.rm = TRUE)) |>
  ggplot(aes(x = cut, y = ave_price)) +
  geom_col() +
  theme_minimal()

For every value of cut, the variable mapped on the horizontal axis, geom_col() shows the variable ave_price, the mean of the diamond’s prices for each category of cut on the vertical axis. Note that here ggplot(data, aes() ...) uses the data frame returned by {dplyr}’s functions as the data.

Using the other aesthetics, you can add other categories. As was the case with geom_bar(), usually only the fill aesthetic is used. Other aesthetics are not visually appealing to differentiate across categories. Here, we’ll show the average price for every value of clarity and cut. Doing so, we first need to calculate these values. Again, we do so using group_by() and summarize(). We map cut on the horizontal axis van use the fill aesthetic to map the variable clarity

diamonds |> group_by(cut, clarity) |>
  summarize(ave_price = mean(price, na.rm = TRUE)) |>
  ggplot(aes(x = cut, y = ave_price, fill = clarity)) +
  geom_col() +
  theme_minimal()
`summarise()` has grouped output by 'cut'. You can override using the `.groups`
argument.

Note that the output doesn’t make sense. The previous graph showed that the average price for fair cut diamonds was a little over 4000. Here, the average price is over 30000. This result is due to the fact that geom_col() by default stacks the values. In this case, this is not how it should be done. Stacking values per category is only relevant is the sum of these values is relevant. Here, this is not the case. Changing the position to dodge2 shows the various averages next to each other, grouped by level of cut. The position dodge or dodge2 can be modified in the same way as in geom_bar().

diamonds |> group_by(cut, clarity) |>
  summarize(ave_price = mean(price, na.rm = TRUE)) |>
  ggplot(aes(x = cut, y = ave_price, fill = clarity)) +
  geom_col(position = "dodge2") +
  theme_minimal()
`summarise()` has grouped output by 'cut'. You can override using the `.groups`
argument.

10.4.0.4.2 Using geom_bar() as geom_col() and vice versa

Both geometries are closely related. In the arguments for geom_bar(), there was stat = "count". Changing this into stat = "identity" tells this geometry to plot values, not counts. Using this value, you can create a bar chart using geom_bar() that is identical to the one created with geom_col(). To see this, let’s use geom_bar() in stead of geom_col() in the previous plot and use the stat = "identity" argument:

diamonds |> group_by(cut, clarity) |>
  summarize(ave_price = mean(price, na.rm = TRUE)) |>
  ggplot(aes(x = cut, y = ave_price, fill = clarity)) +
  geom_bar(stat = "identity", position = "dodge2") +
  theme_minimal()
`summarise()` has grouped output by 'cut'. You can override using the `.groups`
argument.

Because we used stat = "identity", geom_bar() now plots the values for the variable mapped on the vertical axis as values that that axis. In other words, using this argument, you can often use geom_bar() as geom_col().

geom_bar() shows counts. However, you can also use geom_col() to show counts. Let’s use this graph created with geom_bar()

diamonds |>
  ggplot(aes(x = cut)) +
  geom_bar() +
  theme_minimal()

To create the same chart in geom_col() we need a data frame that includes the number of observations but value for cut. Using summarize(n = n()), this is what we can do. Using the result of these function in ggplot() with geom_col() shows a geom_bar() type of output:

diamonds |> 
  group_by(cut) |> 
  summarize(n = n()) |>
  ggplot(aes(x = cut, y = n)) +
  geom_col() +
  theme_minimal()

10.4.0.4.3 Coordinate flip

We already touched upon coordinates. There, I illustrated coordinate flip or coord_flip(). For bar and column charts, the effect is that variable mapped on the horizontal axis is shown on the vertical axis and the values shown on the vertical axis are measured along the horizontal axis. Note that you don’t have to change the aesthetics mapping. If is sufficient to add coord_flip():

diamonds |>
  ggplot(aes(x = cut)) +
  geom_bar() +
  coord_flip() +
  theme_minimal()

You would have had the same result if you mapped the discrete variable on the vertical axis:

diamonds |>
  ggplot(aes(y = cut)) +
  geom_bar() +
  theme_minimal()

However, the advantage of coord_flip() is that is doesn’t change e.g. values for scales. As the discrete variable is mapped on the x-axis and is only shown on the y-axis, scale_x_discrete() … refer to the right variable. You can also flip coordinates with geom_col() and with more than one category where you reverse the order of the subcategories:

diamonds |> group_by(cut, clarity) |>
  summarize(ave_price = mean(price, na.rm = TRUE)) |>
  ggplot(aes(x = cut, y = ave_price, fill = clarity)) +
  geom_col(position = position_dodge2(padding = -0.3, reverse = TRUE)) +
  coord_flip() +
  theme_minimal()
`summarise()` has grouped output by 'cut'. You can override using the `.groups`
argument.

10.4.0.4.4 Adding text

Sometimes it is useful to add values as text to a bar or column chart. Recall that you can add text using, e.g. geom_text() or geom_label(). Adding this text layer to a bar or column plot allows you to add e.g. the values they represent. Here, the values are added in white to a bar chart in blue. To do so, add geom_text() and map the sum n to the label aesthetic. This geometry inherits the x- and y- mappings. In other words, geom_tex() knows the values on the x-asis and the height along the y-axis. The values are nudged down 200 units. As the units on the vertical axis are measured from 0 to over 20000, a 200 nudge down brings these values within the bars. This might require a bit of experimenting to get right. Adding a positive nudge would add them to the top.

diamonds |> 
  group_by(cut) |> 
  summarize(n = n()) |>
  ggplot(aes(x = cut, y = n)) +
  geom_col(fill = "lightsteelblue") +
  geom_text(aes(label = n), nudge_y = -200, color = "white") +
  theme_minimal()

I refer to the section on point geometries for further possibilities with geom_text() and similar geometries.

10.4.0.5 Geometries for distributions

There are multiple ways to show the distribution of a variable. geom_histrogram() and geom_density() visualize the full distribution. Using geom_boxplot() or geom_violin() you can show the distribution, using summary statistics such as the mean, various quantiles and percentiles, minimum and maximum. For two continuous variables, you can use geom_bin2d(). geom_density_2d() or geom_density_2d_filled(). Here we will not go into detail on all of these geometries. We’ll focus on those for one variable and show, as an illustration, how you can extend them to two variables.

10.4.0.5.1 geom_histrogram() and geom_density()

geom_histrogram() divides the full range of possible values for a variable into bins. For every bin, geom_histrogram() shows the number of observation. Recall that is is also what geom_bar() did. In other words, you would be able to use geom_bar() or geom_col() to generate this plot. The arguments of geom_histrogram() are:

geom_histogram(
  mapping = NULL,
  data = NULL,
  stat = "bin",
  position = "stack",
  ...,
  binwidth = NULL,
  bins = NULL,
  na.rm = FALSE,
  orientation = NA,
  show.legend = NA,
  inherit.aes = TRUE
)

Most should be familiar. The binwidth = allows you to specify the width of a bin, e.g. 50 or 25. bins allows you to set the number of bins. By default this value of 30. There is one other way to change the bins and that is to define them using breaks. Using this argument, you can include a vector with bins or use, e.g. seq() to generate the bins. There is only one required aesthetic: the variable to map on the x-axis. To illustrate this function, we’ll visualize the distribution for price in the diamonds dataset. Accepting all defaults:

diamonds |>
  ggplot(aes(x = price)) +
  geom_histogram() +
  theme_minimal()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

You can change the settings of the plot in the usual way: change the fill, change the color of the lines, … . Experimenting with the number of bins is often a good idea. Doing so, you can see what number is best for the data. Let’s change the number of bins in three ways:

  • increase the number of bins from 30 to 50:
diamonds |>
  ggplot(aes(x = price)) +
  geom_histogram(bins = 50) +
  theme_minimal()

  • set the bin width to 250:
diamonds |>
  ggplot(aes(x = price)) +
  geom_histogram(binwidth = 250) +
  theme_minimal()

  • use seq() to set bins every 500:
diamonds |>
  ggplot(aes(x = price)) +
  geom_histogram(breaks = seq(from = 0, to = 20000, by = 500)) +
  theme_minimal()

If you map another variable on e.g. the aesthetic fill, geom_historgram() will show the count for each level of cut in every bin using different colors. As with the bar geometries, other aesthetics are much less suitable to use as aesthetic mapping. Using the fill aesthetic to map cut, the histogram shows all observations for each level of cut within each bin:

diamonds |>
  ggplot(aes(x = price, fill = cut)) +
  geom_histogram() +
  theme_minimal()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

For two variables, geom_hex() extends geom_histogram(). Both variables are divided into bins (default 30) and the graph shows the number of observations per crossed bin. The function requires a mapping on the x-axis and one on the y-axis. To illustrate, I’ll use a data frame with random draws from a bivariate normal distribution

mat1 <- matrix(rnorm(2000, 0, 1), nrow = 1000, ncol = 2)
cov <- matrix(c(1, 0.5, 0.5, 1), nrow = 2, ncol = 2)
mat1 <- mat1 %*% cov
df_mat1 <- as.data.frame(mat1)
df_mat1 <- df_mat1 |> rename(var1 = V1, var2 = V2)

With 50 bins, this is how the histograms of these two variables, var1 and var2 look like:

plothist1 <- df_mat1 |> 
  ggplot(aes(x = var1)) +
  geom_histogram(bins = 50) +
  theme_minimal()

plothist2 <- df_mat1 |> 
  ggplot(aes(x = var2)) +
  geom_histogram(bins = 50) +
  theme_minimal()

plothist1 + plothist2

Using geom_hex() you plot both in one plot. Here, for every of 2500 bins (50 for var1 and 50 for var1) this plot shows the number observations in each and every one of these bins:

df_mat1 |> 
  ggplot(aes(x = var1, y = var2)) +
  geom_hex(bins = 50) +
  theme_minimal()

There color scale here, shows a higher number of observations with a lighter blue color. Consistent with the histograms for the individual series var1 and var2, geom_hex() show higher counts in bins closer to 0.

geom_histrogram() shows the count by default. You can change that in a density (i.e. probability) using an after_stat() value for the stat argument. Using after_state("density"), geom_histrogram() maps these estimated probabilities on the y-axis. With 50 bins, this generates this plot:

diamonds |>
  ggplot(aes(x = price)) +
  geom_histogram(aes(y = after_stat(density)), bins = 50) +
  theme_minimal()

To show a density, you can also use geom_density(). The arguments of this geometry are:

geom_density(
  mapping = NULL,
  data = NULL,
  stat = "density",
  position = "identity",
  ...,
  na.rm = FALSE,
  orientation = NA,
  show.legend = NA,
  inherit.aes = TRUE,
  outline.type = "upper"
)

The stats = "density" by default calculates densities. Changing this into “count” generate a geom_histromgram(). geom_density() uses a by default a Gaussian kernel density estimator to smooth the estimates of the probabilities. For instance, the density estimate for the price of diamonds:

diamonds |>
  ggplot(aes(x = price)) +
  geom_density() +
  theme_minimal()

Using the full aesthetic, you can show density plots for every value of the variable mapped on that aesthetic. For instance, to plot these price density for various values of cut, you map the latter on the fill aesthetic:

diamonds |>
  ggplot(aes(x = price, fill = cut)) +
  geom_density() +
  theme_minimal()

Her problem here is that the density plot for e.g. “Ideal” with the one for e.g. “Fair”. To show all density plots, you can add some transparency using the alpha setting. If, in addition, you map cut also on the color aesthetic, geom_density() will add lines at the top of each density:

diamonds |>
  ggplot(aes(x = price, fill = cut, color = cut)) +
  geom_density(alpha = 1/10, linewidth = 1) +
  theme_minimal()

To plot a density with 2 variables, you can either use geom_density2d() or geom_density2d_filled(). The first shows the bivariate density using contour plots, the second using filled contour plots. To illustrate, we’ll use the random data in df_mat1. The individual densities look like this:

plotmat1 <- df_mat1 |>
  ggplot(aes(x = var1)) +
  geom_density() +
  theme_minimal()

plotmat2 <- df_mat1 |>
  ggplot(aes(x = var2)) +
  geom_density() +
  theme_minimal()

plotmat1 + plotmat2 

Using geom_density2d(), the result is

df_mat1 |>
  ggplot(aes(x = var1, y = var2)) +
  geom_density2d() +
  theme_minimal()

The contour lines reveal higher probabilities the more they are circled with other countour lines. Here, consistent with the univariate densities, the probability that you’ll see a pair with values for var1 and var2 increases as the values for both these variables close in on 0. With geom_density2d_filled() the plot is filled:

df_mat1 |>
  ggplot(aes(x = var1, y = var2)) +
  geom_density2d_filled() +
  theme_minimal()

Here, the color reveals information on the probabilities. For this plot, value pairs with a high probability are drawn in yellow. Ast he color scale moves from yellow to blue, the probability of a value pair in that range falls.

10.4.0.5.2 geom_boxplot() and geom_violin()

A boxplot shows the distribution of a variable using a box, which shows the 1st and 3rd quartile as its outer ranges, a whisker up and down, both with a length of 1.50 times the interquartile range (or the value of the 3rd minus the value of the 1st quartile). In addition, the box shows the median. You can adapt the way in which geom_boxplot() shows the outliers. Outliers are values outside of the range of the boxplot. To do so, you need to change the arguments of the function:

geom_boxplot(
  mapping = NULL,
  data = NULL,
  stat = "boxplot",
  position = "dodge2",
  ...,
  outliers = TRUE,
  outlier.colour = NULL,
  outlier.color = NULL,
  outlier.fill = NULL,
  outlier.shape = 19,
  outlier.size = 1.5,
  outlier.stroke = 0.5,
  outlier.alpha = NULL,
  notch = FALSE,
  notchwidth = 0.5,
  staplewidth = 0,
  varwidth = FALSE,
  na.rm = FALSE,
  orientation = NA,
  show.legend = NA,
  inherit.aes = TRUE
)

The stat = "boxplot" defines the default values. If you want to change these, you need to add, e.g. coef = 1.75 to extend the whiskers to 1.75 the IQR. The treatment of outliers is governed by the arguments starting with outlier. The first, outliers = TRUE by default shows the outliers. Changing this into FALSE hides the outliers. The axis of the plot will be adjusted accordingly, unless you add oulier.shape = NA. If outliers are shown, you can change their color, fill, shape, size, stroke and alpha setting. Here, I refer to the point geometries to see how these affect the plot. notch = FALSE makes a traditional boxplot. Setting this to TRUE, R will show a notched boxplot, where the notches allow you to compare the medians across groups. You can set the width of the notch using notchwidth, by default this value equals 0.50, i.e. the notch covers half of the box width. The staples mark the end of the whiskers. The staple width, which is by default 0, allows you to specify the width of these staples. The last argument, varwidth = FALSE allows you to change the width of the box. By default, this is not the case. If changed to TRUE, the width of the box will be proportional to the square root of observations.

Let’s first draw a boxplot for one variable, price. We’ll map that variable on the y-axis:

diamonds |>
  ggplot(aes(y = price)) +
  geom_boxplot() +
  theme_minimal()

Note that the x-asis in these case has no real meaning, so you can actually leave it blank. Let’s add a notch with width 0.25, increase the length of the whiskers to 2 times the IQR, change the way outliers are shown (color = red) and add a staple at the end of the whisker. Note that you use additional settings to change, e.g. the color of the lines, the fill of the box, …

diamonds |>
  ggplot(aes(y = price)) +
  geom_boxplot(
    coef = 2, 
    notch = TRUE, 
    notchwidth = 0.25,
    outlier.color = "red",
    staplewidth = 0.10) +
  theme(
    axis.line.x = element_blank(),
    axis.title.x = element_blank(),
    axis.text.x = element_blank(),
    panel.background = element_rect(fill = "white"))

To compare distributions across values of another variable, we need to map this variable on the x-aesthetic. If you map the same variable on the fill aesthetic, the color of the boxes will change with these values the plot will also show a legend. As such, it is not necessary as the box plot will show the values on of the variables mapped on the x-asis as label. However, doing so, you can show a legend and remove the x-axis. To start, let’s map cut on the x-axis and price on the y-axis. With the exception of notch, which we’ll set to TRUE, all default valules apply:

diamonds |>
  ggplot(aes(x = cut, y = price)) +
  geom_boxplot(notch = TRUE) +
  theme_minimal()

geom_boxplot() now shows one boxplot per value of cut. The notches suggest that here is a significant difference in the median value between the price for “Ideal” cut diamonds and the price for “Premium” cut diamonds. Whether this is also the case for the values “Fair” and “Good” is not immediately visible as the notches seem to overlap. Using varwidth = TRUE we can adjust the width of the box using the square root of the number of observations as criterium. To do this, you set varwidth = TRUE:

diamonds |>
  ggplot(aes(x = cut, y = price)) +
  geom_boxplot(notch = TRUE, varwidth = TRUE) +
  theme_minimal()

The width of each box is now proportional to the square root of the number of observations. With 1610 observations in “Fair” and 21551 in “Ideal”:

diamonds |> group_by(cut) |> summarize(n = n()) |> mutate(sqrtn = round(sqrt(n), 2))
# A tibble: 5 × 3
  cut           n sqrtn
  <ord>     <int> <dbl>
1 Fair       1610  40.1
2 Good       4906  70.0
3 Very Good 12082 110. 
4 Premium   13791 117. 
5 Ideal     21551 147. 

the width of the “Ideal” box is

\[ \frac{\sqrt{21551}}{\sqrt{1610}} = 3.65 \] 3.65 times wider then the width of the “Fair” box.

To illustrate these we use of the aesthetic fill to map cut and some other setting for the outliers, let’s use the fill aesthetic to map cut and add staples with width equal to 25% of the box width:

diamonds |>
  ggplot(aes(x = cut, y = price, fill = cut)) +
  geom_boxplot(
    notch = TRUE, 
    varwidth = TRUE, 
    staplewidth = 0.25, 
    outlier.colour = "red", 
    outlier.alpha = 1/5) +
   theme(
    axis.line.x = element_blank(),
    axis.title.x = element_blank(),
    axis.text.x = element_blank(),
    panel.background = element_rect(fill = "white"))

Violin plots are closely related to boxplots, but reveal fore information about the distribution of the variable. They are meant to show the distribution of one continuous variable per value of a discrete variable. The arguments in geom_violin() should be largely familiar. The argument draw_quantiles = FALSE allows you to add horizontal lines for every quantile. The scale argument determines the area of the violin. By default, this area is equal across violins. Changing this default value into “count” will creates violins where areas are scaled proportionally to the number of observations while “width” keeps the width of the violins equal across groups:

geom_violin(
  mapping = NULL,
  data = NULL,
  stat = "ydensity",
  position = "dodge",
  ...,
  draw_quantiles = NULL,
  trim = TRUE,
  bounds = c(-Inf, Inf),
  scale = "area",
  na.rm = FALSE,
  orientation = NA,
  show.legend = NA,
  inherit.aes = TRUE
)

Let’s use a violin plot to show the distribution of price for every value of cut. To do so, we map the variable price on the vertical axis and cut on the horizontal axis:

diamonds |>
  ggplot(aes(x = cut, y = price)) +
  geom_violin() +
  theme_minimal()

The violins for every value of cut show you where the observations are. The very thin upper part shows that there are almost no observations for high price levels. At the lower part of the price levels, there seems to be more spread for “Fair” cut diamonds that for e.g. “Ideal” cut diamonds. For the latter, the distribution is very wide at the lowest part of the price range, suggesting that here are a lot of observations for these price cut pairs. To see this more in detail, let’s add the a sample of the observations of the diamonds dataset using a point geometry with a bit of jitter and transparency:

diamonds |>
  ggplot(aes(x = cut, y = price)) +
  geom_violin() +
  geom_point(
    data = diamonds |> slice_sample(prop = 0.10),
    aes(x = cut, y = price), 
    position = position_jitter(width = 0.20, height = 0.20), alpha = 1/5, 
    color = "red", 
    size = 0.5) +
  theme_minimal()

Here you see that indeed, there are a lot of observations in the lower price range for “Ideal” cut diamonds.

As is the case with geom_histogram() you can map the same variable on both the x-axis and e.g. fill aesthetic. Doing so, geom_violin() will fill the violins and the color to the legend. If you want to add quantiles, e.g. the 10th and 90th percentile and the first and third quartile, you can specify those using draw_quartiles = c(0.10, 0.25, 0.75, 0.90). Here, we use the color white to draw these lines. By default R uses black and this color wouln’t show up in the “Fair” cut category:

diamonds |>
  ggplot(aes(x = cut, y = price, fill = cut)) +
  geom_violin(draw_quantiles = c(0.10, 0.25, 0.75, 0.90), color = "white") +
  theme_minimal()

10.4.0.6 Other geometries

{ggplot2} includes many other geometries. However, you can create most of these graphs with the geometries that we have covered here. To illustrate, geom_linerange(), geom_pointrange() and geom_errorbar() allow you to draw a line in a specific range, a line with a range including a dot in the middle, or a line with a bar at both ends. These geometries allow you to e.g. show the range of values in a dataset. Here, you can define the range using the mean and standard deviation, median and interquartile distance, … . However, you can recreate all three geometries using geom_segment() and geom_point(). To illustrate, we’ll show the price range for various categories of clarity in the diamonds using each of the two range geometries and the errorbar geometry was well as the segment and point geometry. The range if always defined as one standard deviation below and above the mean.

  • geom_linerange()
plotlinerange1 <- diamonds |> group_by(clarity) |>
  summarize(
    ave_price = mean(price, na.rm = TRUE), 
    sd_price = sd(price, na.rm = TRUE)) |>
  ggplot(aes(x = clarity, y = ave_price, fill = clarity)) +
  geom_linerange(aes(ymin = ave_price - sd_price, ymax = ave_price + sd_price, color = clarity), linewidth = 2) +
  coord_flip() +
  labs(
    title = "geom_linerange()"
  ) +
  theme_minimal()

plotlinerange2 <- diamonds |> group_by(clarity) |>
  summarize(
    ave_price = mean(price, na.rm = TRUE), 
    sd_price = sd(price, na.rm = TRUE)) |>
  ggplot(aes(x = clarity, y = ave_price, fill = clarity)) +
  geom_segment(aes(y = ave_price - sd_price, yend = ave_price + sd_price, color = clarity), linewidth = 2) +
  coord_flip() +
  labs(
    title = "geom_segment()"
  ) +
  theme_minimal()

plotlinerange1 + plotlinerange2

  • geom_pointrange()
plotpointrange1 <- diamonds |> group_by(clarity) |>
  summarize(
    ave_price = mean(price, na.rm = TRUE), 
    sd_price = sd(price, na.rm = TRUE)) |>
  ggplot(aes(x = clarity)) +
  geom_pointrange(aes(y = ave_price, ymin = ave_price - sd_price, ymax = ave_price + sd_price, color = clarity), linewidth = 1, size = 2) +
  coord_flip() +
    labs(
    title = "geom_pointrange()"
  ) +
  theme_minimal()

plotpointrange2 <- diamonds |> group_by(clarity) |>
  summarize(
    ave_price = mean(price, na.rm = TRUE), 
    sd_price = sd(price, na.rm = TRUE)) |>
  ggplot(aes(x = clarity)) +
  geom_segment(aes(y = ave_price - sd_price, yend = ave_price + sd_price, color = clarity), linewidth = 1, show.legend = FALSE) +
  geom_point(aes(y = ave_price, color = clarity), size = 8, shape = 16) +
  coord_flip() +
    labs(
    title = "geom_segment and geom_point()"
  ) +
  theme_minimal()

plotpointrange1 + plotpointrange2

  • geom_errorbar()
ploterror1 <- diamonds |> group_by(clarity) |>
  summarize(
    ave_price = mean(price, na.rm = TRUE), 
    sd_price = sd(price, na.rm = TRUE)) |>
  ggplot(aes(x = clarity)) +
  geom_errorbar(aes(ymin = ave_price - sd_price, ymax = ave_price + sd_price, color = clarity), linewidth = 2) +
  coord_flip() +
    labs(
    title = "geom_errorbar"
  ) +
  theme_minimal()

ploterror2 <- diamonds |> group_by(clarity) |>
  summarize(
    ave_price = mean(price, na.rm = TRUE), 
    sd_price = sd(price, na.rm = TRUE)) |>
  ggplot(aes(x = clarity, y = ave_price, fill = clarity)) +
  geom_segment(aes(y = ave_price - sd_price, yend = ave_price + sd_price, color = clarity), linewidth = 2) +
  geom_point(aes(y = ave_price - sd_price, color = clarity), shape = "\uFE31", size = 10, stroke = 2, show.legend = FALSE) +
  geom_point(aes(y = ave_price + sd_price, color = clarity), shape = "\uFE31", size = 10, stroke = 2, show.legend = FALSE) +
  coord_flip() +
    labs(
    title = "geom_segment and geom_point()"
  ) +
  theme_minimal()

ploterror1 + ploterror2

Here, I used a unicode character “uFE31” to end each line segment.

In addition there are many other packages that were designed with specific application in mind. For instance using {[ggwordloud] (https://lepennec.github.io/ggwordcloud/index.html)} (Le Pennec and Slowikowski (2024)) you can build word clouds showing which words occurred most in a text by increase the size of the font, {[treemapify] (https://wilkox.org/treemapify/index.html)} makes it easier to create treemaps and show, e.g. the relative importance product or market in total sales or exports, {[ggradar] (https://github.com/ricardo-bion/ggradar)} is very useful to design radar charts to compare how various products, countries for firms score on a fixed set of characteristics or{[ggbump] (https://github.com/davidsjoberg/ggbump)} can be used to create bump charts to visualize the change in ranking over time.

10.5 Annotations

Using geom_text(), geom_label(), geom_hline(), geom_vline(), geom_abline(), geom_segment() or geom_curve() you can add visual annotations to a graph: a label with the values for a bar chart, a horizontal or vertical line showing a specific data or an average, … . Using annotations() there is another way to add additional information to a plot. The main arguments of the annotation() function are:

annotate(
  geom,
  x = NULL,
  y = NULL,
  xmin = NULL,
  xmax = NULL,
  ymin = NULL,
  ymax = NULL,
  xend = NULL,
  yend = NULL,
  ...,
  na.rm = FALSE
)

The first argument geom refers to the geometry to use e.g. text, rectangle, line segment, point range:

  • “point for a point: x, y
  • “text” for a text annotation: x, y, label
  • “rect” for a rectangle: xmin, xmax, ymin, ymax
  • “segment” to add a line segment: x, y, xend, yend
  • “pointrange” to add a point range: x, y, ymin, ymax

Every geometry needs a minimum of values: a point needs a position (x, y), a text needs a position, defined by the (xmin, ymin) coordinate and a text to add (label), a rectangle needs to four corners, a line segment needs a start and end position, both defined by (xmin, ymin) and (xmax, ymax) and a pointrange - a vertical line with a midpoint - needs an x value, a value where the line starts and ends (ymin and ymax) as well the position where to add a point. For all these geometries, you can add further settings in line with that geometry. For instance, for a text, you can add a font family, fontface, size or the horizontal and vertical alignment (0 = left/top, 1 = right/bottom, 0.5 = center), annotations that include lines, you can add setting for line types or width and a rectangle can be filled. To illustrate these annotations, we’ll use the vacancy rate in the Beverdige dataset. If you haven’t imported it yet, you can do so here:

data_beveridge <- read_csv(here::here("data", "raw", "data_beveridge.csv")) 
data_beveridge <- data_beveridge |> rename(date = DATE, quarter = `TIME PERIOD`)

Here, we’ll map the date on the horizontal axis and the vacancy rate on the vertical axis and use a simple line geometry to show the data. In addition, we use annotate() to add 3 rectangles: one to highlight the financial crisis, one to highlight the Euro area debt crisis and one to highlight the pandemic. We’ll fill these reactangles with a different color. In addition, we’ll add a text annotation to include the reference to the crisis, aligh this text left of the rectangle and use the font family “serif”. To set the limits for the rectangle, we’ll first set the maximum value for the vacancy rate. The full code for this plot is:

ym <- max(data_beveridge$vacancy_rate)

data_beveridge |> select(date, quarter, vacancy_rate) |>
  ggplot(aes(x = date, y = vacancy_rate)) +
  geom_line() +
  annotate("rect", 
           xmin = as.Date("2008-03-31"), 
           xmax = as.Date("2009-08-31"), 
           ymin = 0, 
           ymax = ym, 
           fill = "lightgrey", 
           alpha = 1/5) +
  annotate("text", 
           x = as.Date("2008-03-31"),
           y = ym - 0.5, 
           label = "Financial crisis", 
           color = "darkgrey", 
           hjust = 0, 
           family = "serif") +
  annotate("rect", 
           xmin = as.Date("2011-06-30"), 
           xmax = as.Date("2013-03-31"), 
           ymin = 0, 
           ymax = ym, 
           fill = "lightsteelblue1", 
           alpha = 1/5) +
  annotate("text", 
           x = as.Date("2011-06-30"),
           y = ym - 0.5, 
           label = "Euro area debt crisis", 
           color = "steelblue", 
           hjust = 0, 
           family = "serif") +
  annotate("rect", 
           xmin = as.Date("2020-01-31"), 
           xmax = as.Date("2021-09-30"), 
           ymin = 0, 
           ymax = ym, 
           fill = "sienna2", 
           alpha = 1/5) +
  annotate("text", 
           x = as.Date("2020-03-31"),
           y = ym - 0.5, 
           label = "Covid pandemic", 
           color = "sienna4", 
           hjust = 0,
           family = "serif") +
  theme_minimal()

Using annotations can add value to a plot. Recall for instance that we use such as annotation to highlight “Premium” cut diamonds in a point geometry.